AI-Generated Summary
- The Mistral 3 family includes a 675B-parameter sparse multimodal and multilingual mixture-of-experts (MoE) model and a suite of dense Ministral 3 models (3B, 8B, 14B parameters; Base, Instruct, Reasoning variants), all trained on NVIDIA Hopper GPUs and optimized for deployment across a broad range of NVIDIA hardware, including GB200 NVL72, DGX Spark, RTX, and Jetson.
- Mistral Large 3 achieves top-tier performance and energy efficiency on NVIDIA GB200 NVL72 through NVIDIA TensorRT-LLM Wide Expert Parallelism, NVFP4 quantization for low-precision inference (with optimized support in SGLang, TensorRT-LLM, and vLLM), and distributed inference frameworks like NVIDIA Dynamo, while offering seamless integration with open-source frameworks (vLLM, SGLang, Llama.cpp, Ollama).
- Advanced quantization via NVFP4, collaborative optimizations with open-source projects, and upcoming features like multitoken prediction (speculative decoding with EAGLE-3) further drive cost, performance, and scalability, providing developers and enterprises with production-ready deployment options (including NVIDIA NIM microservices, Hugging Face, and direct API access) from edge to cloud.
AI-generated content may summarize information incompletely. Verify important information. Learn more
The new Mistral 3 open model family delivers industry-leading accuracy, efficiency, and customization capabilities for developers and enterprises. Optimized from NVIDIA GB200 NVL72 to edge platforms, Mistral 3 includes:
- One large state-of-the-art sparse multimodal and multilingual mixture of experts (MoE) model with a total parameter count of 675B
- A suite of small, dense high-performance models (called Ministral 3) of sizes 3B, 8B, and 14B, each with Base, Instruct, and Reasoning variants (nine models total)
All the models were trained on NVIDIA Hopper GPUs and are now available through Mistral AI on Hugging Face. Developers can choose from a variety of options for deploying these models on different NVIDIA GPUs with different model precision formats and open source framework compatibility (Table 1).
| Mistral Large 3 | Ministral-3-14B | Ministral-3-8B | Ministral-3-3B | |
| Total parameters | 675B | 14B | 8B | 3B |
| Active parameters | 41B | 14B | 8B | 3B |
| Context window | 256K | 256K | 256K | 256K |
| Base | – | BF16 | BF16 | BF16 |
| Instruct | – | Q4_K_M, FP8, BF16 | Q4_K_M, FP8, BF16 | Q4_K_M, FP8, BF16 |
| Reasoning | Q4_K_M, NVFP4, FP8 | Q4_K_M, BF16 | Q4_K_M, BF16 | Q4_K_M, BF16 |
| Frameworks | ||||
| vLLM | ✔ | ✔ | ✔ | ✔ |
| SGLang | ✔ | – | – | – |
| TensorRT-LLM | ✔ | – | – | – |
| Llama.cpp | – | ✔ | ✔ | ✔ |
| Ollama | – | ✔ | ✔ | ✔ |
| NVIDIA hardware | ||||
| GB200 NVL72 | ✔ | ✔ | ✔ | ✔ |
| Dynamo | ✔ | ✔ | ✔ | ✔ |
| DGX Spark | ✔ | ✔ | ✔ | ✔ |
| RTX | – | ✔ | ✔ | ✔ |
| Jetson | – | ✔ | ✔ | ✔ |
Mistral Large 3 delivers best-in-class performance on NVIDIA GB200 NVL72
NVIDIA-accelerated Mistral Large 3 achieves best-in-class performance on NVIDIA GB200 NVL72 by leveraging a comprehensive stack of optimizations tailored for large state-of-the-art MoEs. Figure 1 shows the performance Pareto frontiers for GB200 NVL72 and NVIDIA H200 across the interactivity range.
Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, GB200 provides up to 10x higher performance than the previous-generation H200, exceeding 5,000,000 tokens per second per megawatt (MW) at 40 tokens per second per user.
This generational gain translates to better UX, lower per-token cost, and higher energy efficiency for the new model. The gain is primarily driven by the following components of the inference optimization stack:
- NVIDIA TensorRT-LLM Wide Expert Parallelism (Wide-EP) provides optimized MoE GroupGEMM kernels, expert distribution and load balancing, and expert scheduling to fully exploit the NVL72 coherent memory domain. Of particular interest is how resilient this Wide-EP feature set is to architectural variations across large MoEs. This enables a model such as Mistral Large 3 (with roughly half as many experts per layer (128) as DeepSeek-R1) to still realize the high-bandwidth, low-latency, non-blocking benefits of the NVIDIA NVLink fabric.
- Low-precision inference that maintains efficiency and accuracy has been achieved using NVFP4, with support from SGLang, TensorRT-LLM, and vLLM.
- Mistral Large 3 relies on NVIDIA Dynamo, a low latency distributed inference framework, to rate-match and disaggregate the prefill and decode phases of inference. This in turn boosts performance for long-context workloads, such as 8K/1K configurations (Figure 1).
As with all models, upcoming performance optimizations—such as speculative decoding with multitoken prediction (MTP) and EAGLE-3—are expected to push performance further, unlocking even more benefits from this new model.
NVFP4 quantization
For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint that was quantized offline using the open-source llm-compressor library. This allows for reducing compute and memory costs while maintaining accuracy, by leveraging the NVFP4 higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error.
The recipe targets only the MoE weights while keeping all other components at their original checkpoint precision. Because NVFP4 is native to Blackwell, this variant deploys seamlessly on GB200 NVL72. NVFP4 FP8-scale factors and fine-grained block scaling keep quantization error low, delivering lower compute and memory cost with minimal accuracy loss.
Open source inference
These open weight models can be used with your open source inference framework of choice. TensorRT-LLM leverages optimizations for large MoE models to boost performance on GB200 NVL72 systems. To get started, you can use the TensorRT-LLM preconfigured Docker container.
NVIDIA collaborated with vLLM to expand support for kernel integrations for speculative decoding (EAGLE), NVIDIA Blackwell, disaggregation, and expanded parallelism. To get started, you can deploy the launchable that uses vLLM on NVIDIA cloud GPUs. To see the boilerplate code for serving the model and sample API calls for common use cases, check out Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs.
Figure 2 shows the range of GPUs available in the NVIDIA build platform where you can deploy Mistral Large 3 and Ministral 3. You can select the appropriate GPU size and configuration for your needs.
NVIDIA also collaborated with SGLang to create an implementation of Mistral Large 3 with disaggregation and speculative decoding. Try it out today by deploying the launchable that uses SGLang on NVIDIA cloud GPUs.
Ministral 3 models deliver speed, versatility, and accuracy
The small, dense high performance Ministral 3 models are for edge deployment. Offering flexibility for a variety of needs, they come in three parameter sizes—3B, 8B, and 14B—each with Base, Instruct, and Reasoning variants. You can try the models on edge platforms like NVIDIA GeForce RTX AI PC, NVIDIA DGX Spark, and NVIDIA Jetson.
When developing locally, you still get the benefit of NVIDIA acceleration. NVIDIA collaborated with Ollama and llama.cpp for faster iteration, lower latency, and greater data privacy. You can expect fast inferencing at up to 385 tokens per second on the NVIDIA RTX 5090 GPU with the Ministral-3B variants. Get started with Llama.cpp and Ollama.
For Ministral-3-3B-Instruct, Jetson developers can use the vLLM container on NVIDIA Jetson Thor to achieve 52 tokens per second for single concurrency, with scaling up to 273 tokens per second with concurrency of 8.
Production-ready deployment with NVIDIA NIM
Mistral Large 3 and Ministral-14B-Instruct are available for use through the NVIDIA API catalog and preview API for developers to get started with minimal setup. Soon enterprise developers can use the downloadable NVIDIA NIM microservices for easy deployment on any GPU-accelerated infrastructure.
Get started building with open source AI
The NVIDIA-accelerated Mistral 3 open model family represents a major leap for Transatlantic AI in the open source community. The flexibility of the models for large-scale MoE and edge-friendly dense transformers meet developers where they are and within their development lifecycle.
With NVIDIA-optimized performance, advanced quantization techniques like NVFP4, and broad framework support, developers can achieve exceptional efficiency and scalability from cloud to edge. To get started, download Mistral 3 models from Hugging Face or test deployment-free on build.nvidia.com/mistralai.
Tags
About the Authors
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.
Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing—focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies.
