VOOZH about

URL: https://deepinfra.com/models/text-generation

⇱ Models | Machine Learning Inference | DeepInfra


We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud β€” read the announcement

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

​
featured
GLM-5.2

GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.
Priority
fp4
1024k
$0.18 cached, $0.95 in, $3.00 out / 1M
featured
Kimi-K2.7-Code

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.
Priority
fp4
256k
$0.15 cached, $0.74 in, $3.50 out / 1M
featured
NVIDIA-Nemotron-3-Ultra-550B-A55B

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.
Priority
256k
$0.10 cached, $0.50 in, $2.20 out / 1M
featured
Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Nemotron 3 Nano Omni is an open multimodal model built on a hybrid Mixture-of-Experts (MoE) architecture, engineered for high efficiency and strong accuracy across image, video, audio, and text inputs. It powers always-on sub-agents for computer use, document intelligence, and audio-video understandingβ€”replacing fragmented vision, speech, and language pipelines with a single unified inference pass.
Priority
bfloat16
256k
$0.20 in, $0.80 out / 1M
featured
DeepSeek-V4-Flash

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.
Priority
fp4
1024k
$0.02 cached, $0.10 in, $0.20 out / 1M
featured
DeepSeek-V4-Pro

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.
Priority
fp4
1024k
$0.10 cached, $1.30 in, $2.60 out / 1M
featured
Kimi-K2.6

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.
Priority
fp4
256k
$0.15 cached, $0.75 in, $3.50 out / 1M
featured
MiMo-V2.5

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.
Priority
256k
$0.08 cached, $0.40 in, $2.00 out / 1M
featured
MiMo-V2.5-Pro

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).
Priority
fp8
1024k
$0.20 cached, $1.00 in, $3.00 out / 1M
featured
Qwen3.6-35B-A3B

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.
Priority
fp8
256k
$0.15 in, $0.95 out / 1M
featured
GLM-5.1

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).
fp4
198k
$0.205 cached, $1.05 in, $3.50 out / 1M
featured
Qwen3.5-397B-A17B

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.
Priority
fp8
256k
$0.22 cached, $0.45 in, $3.00 out / 1M
featured
gemma-4-26B-A4B-it

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.
Priority
fp8
256k
$0.07 in, $0.34 out / 1M
featured
gemma-4-31B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.
Priority
fp8
256k
$0.13 in, $0.38 out / 1M
featured
NVIDIA-Nemotron-3-Super-120B-A12B

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.
Priority
bfloat16
256k
$0.085 in, $0.40 out / 1M
featured
GLM-5

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.
fp4
198k
$0.12 cached, $0.60 in, $2.08 out / 1M
featured
MiniMax-M2.5

MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).
Priority
fp8
192k
$0.03 cached, $0.15 in, $1.15 out / 1M
featured
Qwen3-Max

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks β€” including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.
Partner
250k
$1.20 in $6.00 out $0.24 cached / 1M tokens
featured
Qwen3-Max-Thinking

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques
Partner
250k
$1.20 in $6.00 out $0.24 cached / 1M tokens
featured
Kimi-K2.5

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Priority
fp4
256k
$0.07 cached, $0.45 in, $2.25 out / 1M
featured
GLM-4.7-Flash

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.
bfloat16
198k
$0.01 cached, $0.06 in, $0.40 out / 1M
featured
DeepSeek-V3.2

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.
fp4
160k
$0.13 cached, $0.26 in, $0.38 out / 1M
Seed-1.8

Optimized specifically for multimodal agent scenarios. It features enhanced agent capabilities, upgraded multimodal comprehension, and more flexible context management.
Partner
250k
$0.05 cached, $0.25 in, $2.00 out / 1M
Seed-2.0-code

A coding model optimized for real-world development environments, with reliable tool use in common IDEs such as Claude Code. It delivers strong front-end performance and supports Skills.
Partner
250k
$0.10 cached, $0.50 in, $3.00 out / 1M
πŸ‘ Built With Love in Palo Alto

Β© 2026 DeepInfra. All rights reserved.