Inference Pricing
Last verified 25 Jun 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Inference has a usage-based pricing model, so costs scale with your actual usage.
Bring Your Own Models (BYOM)
BYOM model weights are stored in a service-managed, non-accessible Spaces location, and are billed at $5.00 per month. We do not charge you for browsing or managing imported models in Model Catalog. Costs apply only for storing model weights and for using those models with other paid features, such as dedicated inference deployments.
Model Playground
Usage is charged at the same rate as serverless inference.
Serverless Inference
Serverless inference is billed by DigitalOcean for both open-source and commercial models. Prices align with each provider’s published rates for transparency.
The following shows pricing for foundation models available through serverless inference.
Dedicated Inference
Dedicated Inference is billed per GPU-hour based on the GPU you use.
| GPU | Price |
|---|---|
| AMD MI300X | $2.59 per hour |
| AMD MI300X (8x) | $20.70 per hour |
| AMD MI325X | $2.98 per hour |
| AMD MI325X (8x) | $23.82 per hour |
| AMD MI350X | $6.89 per hour |
| NVIDIA B300 | $10.39 per hour |
| NVIDIA B300 (8x) | $83.10 per hour |
| NVIDIA H100 | $4.41 per hour |
| NVIDIA H100 (8x) | $30.32 per hour |
| NVIDIA H200 | $4.47 per hour |
| NVIDIA H200 (8x) | $35.78 per hour |
Batch Inference
Batch inference is charged at up to a 50% discount on OpenAI and Anthropic models.
You are only charged for completed requests. If a batch job fails, is blocked by guardrails, or expires partway through, requests that were not processed are not charged.
Inference Router public
Inference Router is available in public preview and enabled for all users. You can contact support for questions or assistance.
There is no additional cost to using Inference Router during public preview. Using inference routing forwards requests to foundation models for serverless inference and dedicated inference. You are billed for the models that serve each request.
Tools Usage
Knowledge base retrieval, DigitalOcean MCP servers, and Anthropic- and OpenAI-only tools, such as tool search and computer use, do not incur additional charges other than the standard per-token inference costs.
The following tools incur charges in addition to the standard per-token inference costs:
- Web search: $10 per 1000 requests, not charged when using Anthropic models
- Web fetch: $3 per 1000 requests, not charged when using Anthropic models
Model Evaluations public
Model evaluations for candidate models deployed on Serverless Inference, and for judge models, are charged at the same token rates as serverless inference.
Knowledge Bases
Knowledge base pricing is shown per million tokens, but billing is calculated per thousand tokens.
You’re billed for both indexing and storage:
-
Tokens used for indexing and retrieval query vectorization: We charge for tokens used to generate embeddings during indexing and to vectorize user queries during retrieval. Both use the same embeddings model pricing.
Indexing pricing is the same for manual and auto-indexing. Indexing charges apply only when changes are detected, such as new, updated, or deleted files or URLs. If auto-indexing is paused or no changes are found, there are no indexing charges.
Retrieval requests sent through a MCP server are billed the same as retrieval requests sent directly to the knowledge base retrieve endpoint. This includes the tokens used to vectorize the retrieval query with the selected embeddings model. For example, a 10 MB dataset is about 3 million tokens, and a 1 GB dataset is about 250 million tokens.
Actual costs depend on the embeddings model:
Model Price all-mini-lm-l6-v2$0.009 per 1M input tokens multi-qa-mpnet-base-dot-v1$0.009 per 1M input tokens gte-large-en-v1.5$0.09 per 1M input tokens Qwen3 Embedding 0.6B$0.04 per 1,000,000 tokens BGE-M3$0.02 per 1,000,000 tokens E5 Large V2$0.02 per 1,000,000 tokens One token is roughly four characters (approximately 75 words per 100 tokens). Non-Latin scripts, emojis, or binary data may increase token counts. -
Reranking tokens: If reranking is enabled, tokens used to rerank results are billed based on the selected reranking model. For supported reranking models, see available reranking models.
Model Price BGE Reranker v2 m3$0.01 per 1M reranking tokens -
Storage: Embeddings are stored in OpenSearch. See OpenSearch pricing.
Chunking has no separate charge. Chunking costs depend on embedding token usage, OpenSearch database, and the selected embeddings model.
Chunking strategy cost depends on how many tokens the strategy embeds and returns:
- Section-based and fixed length chunking are the most cost-efficient because they use simple splitting and have predictable token usage.
- Semantic chunking costs more because it uses the embeddings model to detect semantic boundaries and embed final chunks, often resulting in 1.5 to 3 times more indexing tokens.
- Hierarchical chunking slightly increases indexing cost by creating parent and child embeddings. It can also increase retrieval cost because agents receive both child and parent chunks for each lookup.
Changing your chunking strategy or configuration requires re-indexing the affected data source, which consumes additional tokens. For guidance on chunking configurations and best practices, see our chunking parameters reference and chunking best practices.
If you use RAG Playground, answer generation is billed separately based on the selected serverless inference model. Free tokens for RAG Playground are not separate; they are shared with Model Playground.
Agent Platform
Agent creation is free. We charge for model usage and for additional features like knowledge bases, guardrails, and log stream insights. We display prices per million tokens and bill per thousand tokens for accuracy.
Model usage is billed by DigitalOcean. You are charged for all input and output tokens processed by the agent at the same token rates as serverless inference. Token usage depends on factors such as input length, agent instructions, attached knowledge bases, and configuration settings. To optimize usage, test your agents and adjust their parameters.
Agent Guardrails
Charges apply for all tokens processed through agent guardrails:
| Guardrail | Price |
|---|---|
| Content Moderation | $0.20 per 1,000,000 tokens |
| Jailbreak Detection | $0.20 per 1,000,000 tokens |
| Sensitive Data Detection | $0.34 per 1,000,000 tokens |
Costs are per token. Creating, editing, or duplicating guardrails has no additional cost.
Functions
If you attach DigitalOcean Functions to your agent, you are billed at functions pricing.
Agent Evaluations
Agent evaluations are charged by token usage at the same rates as model usage.
Log Stream Insights
Log Stream Insights uses a third-party model to analyze agent trace data. You are charged per token:
| Tokens | Price |
|---|---|
| Input | $1.10 per 1,000,000 tokens |
| Output | $4.40 per 1,000,000 tokens |
Agent Development Kit public
You are not charged for using the Agent Development Kit during public preview. However, you are billed for other DigitalOcean Inference features you use with your agent deployment:
-
We charge for model usage for Agent Development Kit (ADK). If you are using a DigitalOcean-hosted model, you are charged for those model keys.
-
For agent evaluations, token usage is charged to the agent model keys. For example, if your agent uses a serverless inference endpoint key, any token usage is charged to that key. If the agent uses a third-party model key, or a key to a model not hosted on DigitalOcean, you are charged by the hosting provider.
-
If you enable Log Stream Insights for your agent deployment, you are charged for tokens when new insights are generated.
