We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Automatic Speech Recognition Embeddings Reranker Text Generation Text To Image Text To Music Text To Speech Text To Video World Model Zero Shot Image Classification

Docs

Pricing

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

automatic-speech-recognition

world-model

zero-shot-image-classification

featured

zai-org/

GLM-5.2

👁 zai-org/GLM-5.2 cover image

GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.

Priority

fp4

1024k

$0.18 cached, $0.95 in, $3.00 out / 1M

featured

text-generation

👁 moonshotai logo

moonshotai/

Kimi-K2.7-Code

👁 moonshotai/Kimi-K2.7-Code cover image

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.

Priority

fp4

256k

$0.15 cached, $0.74 in, $3.50 out / 1M

featured

text-generation

👁 nvidia logo

nvidia/

NVIDIA-Nemotron-3-Ultra-550B-A55B

👁 nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B cover image

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.

Priority

256k

$0.10 cached, $0.50 in, $2.20 out / 1M

featured

text-generation

👁 nvidia logo

nvidia/

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

👁 nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning cover image

Nemotron 3 Nano Omni is an open multimodal model built on a hybrid Mixture-of-Experts (MoE) architecture, engineered for high efficiency and strong accuracy across image, video, audio, and text inputs. It powers always-on sub-agents for computer use, document intelligence, and audio-video understanding—replacing fragmented vision, speech, and language pipelines with a single unified inference pass.

Priority

bfloat16

256k

$0.20 in, $0.80 out / 1M

featured

text-generation

👁 deepseek-ai logo

deepseek-ai/

DeepSeek-V4-Flash

👁 deepseek-ai/DeepSeek-V4-Flash cover image

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

Priority

fp4

1024k

$0.02 cached, $0.10 in, $0.20 out / 1M

featured

text-generation

👁 deepseek-ai logo

deepseek-ai/

DeepSeek-V4-Pro

👁 deepseek-ai/DeepSeek-V4-Pro cover image

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

Priority

fp4

1024k

$0.10 cached, $1.30 in, $2.60 out / 1M

featured

text-generation

👁 moonshotai logo

moonshotai/

Kimi-K2.6

👁 moonshotai/Kimi-K2.6 cover image

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Priority

fp4

256k

$0.15 cached, $0.75 in, $3.50 out / 1M

featured

text-generation

👁 XiaomiMiMo logo

XiaomiMiMo/

MiMo-V2.5

👁 XiaomiMiMo/MiMo-V2.5 cover image

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.

Priority

256k

$0.08 cached, $0.40 in, $2.00 out / 1M

featured

text-generation

👁 XiaomiMiMo logo

XiaomiMiMo/

MiMo-V2.5-Pro

👁 XiaomiMiMo/MiMo-V2.5-Pro cover image

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

Priority

fp8

1024k

$0.20 cached, $1.00 in, $3.00 out / 1M

featured

text-generation

👁 Qwen logo

Qwen/

Qwen3.6-35B-A3B

👁 Qwen/Qwen3.6-35B-A3B cover image

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Priority

fp8

256k

$0.15 in, $0.95 out / 1M

featured

text-generation

👁 zai-org logo

zai-org/

GLM-5.1

👁 zai-org/GLM-5.1 cover image

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

fp4

198k

$0.205 cached, $1.05 in, $3.50 out / 1M

featured

text-generation

👁 Qwen logo

Qwen/

Qwen3.5-397B-A17B

👁 Qwen/Qwen3.5-397B-A17B cover image

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Priority

fp8

256k

$0.22 cached, $0.45 in, $3.00 out / 1M

featured

text-generation

👁 google logo

google/

gemma-4-26B-A4B-it

👁 google/gemma-4-26B-A4B-it cover image

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

Priority

fp8

256k

$0.07 in, $0.34 out / 1M

featured

text-generation

👁 google logo

google/

gemma-4-31B-it

👁 google/gemma-4-31B-it cover image

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

Priority

fp8

256k

$0.13 in, $0.38 out / 1M

featured

text-generation

👁 nvidia logo

nvidia/

NVIDIA-Nemotron-3-Super-120B-A12B

👁 nvidia/NVIDIA-Nemotron-3-Super-120B-A12B cover image

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

Priority

bfloat16

256k

$0.085 in, $0.40 out / 1M

featured

text-generation

👁 zai-org logo

zai-org/

GLM-5

👁 zai-org/GLM-5 cover image

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

fp4

198k

$0.12 cached, $0.60 in, $2.08 out / 1M

featured

text-generation

👁 MiniMaxAI logo

MiniMaxAI/

MiniMax-M2.5

👁 MiniMaxAI/MiniMax-M2.5 cover image

MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

Priority

fp8

192k

$0.03 cached, $0.15 in, $1.15 out / 1M

featured

text-generation

👁 Qwen logo

Qwen/

Qwen3-Max

👁 Qwen/Qwen3-Max cover image

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Partner

250k

$1.20 in $6.00 out $0.24 cached / 1M tokens

featured

text-generation

👁 Qwen logo

Qwen/

Qwen3-Max-Thinking

👁 Qwen/Qwen3-Max-Thinking cover image

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Partner

250k

$1.20 in $6.00 out $0.24 cached / 1M tokens

featured

text-generation

👁 moonshotai logo

moonshotai/

Kimi-K2.5

👁 moonshotai/Kimi-K2.5 cover image

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Priority

fp4

256k

$0.07 cached, $0.45 in, $2.25 out / 1M

featured

text-generation

👁 zai-org logo

zai-org/

GLM-4.7-Flash

👁 zai-org/GLM-4.7-Flash cover image

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

bfloat16

198k

$0.01 cached, $0.06 in, $0.40 out / 1M

featured

text-generation

👁 deepseek-ai logo

deepseek-ai/

DeepSeek-V3.2

👁 deepseek-ai/DeepSeek-V3.2 cover image

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

fp4

160k

$0.13 cached, $0.26 in, $0.38 out / 1M

text-generation

👁 ByteDance logo

ByteDance/

Seed-1.8

👁 ByteDance/Seed-1.8 cover image

Optimized specifically for multimodal agent scenarios. It features enhanced agent capabilities, upgraded multimodal comprehension, and more flexible context management.

Partner

250k

$0.05 cached, $0.25 in, $2.00 out / 1M

text-generation

👁 ByteDance logo

ByteDance/

Seed-2.0-code

👁 ByteDance/Seed-2.0-code cover image

A coding model optimized for real-world development environments, with reliable tool use in common IDEs such as Claude Code. It delivers strong front-end performance and supports Skills.

Partner

250k

$0.10 cached, $0.50 in, $3.00 out / 1M

👁 Footer Logo

👁 SOC 2 Certified
👁 ISO 27001 Certified

Have questions or need a custom solution?

Company