Agentic AI / Generative AI

NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit

👁 Image

Aug 25, 2025

By Kirthi Devleker and Farshad Ghodsian

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA's NVFP4 format enables 4-bit pretraining, allowing AI factories to scale more rapidly and sustainably by cutting memory needs, boosting arithmetic throughput, and optimizing communication.
The NVFP4 pretraining recipe leverages techniques like micro-block scaling, high-precision block encoding, and stochastic rounding to maintain model accuracy and stability during large-scale training.
Experiments with a 12-billion parameter model showed that NVFP4 achieved accuracy comparable to higher precision formats like FP8, demonstrating its potential for efficient large-scale frontier model training.

AI-generated content may summarize information incompletely. Verify important information. Learn more

In recent years, AI workloads have grown exponentially—not only in the deployment of large language models (LLMs) but also in the demand to process ever more tokens during pretraining and post-training. As organizations scale up compute infrastructure to train and deploy multi-billion-parameter foundation models, the ability to sustain higher token throughput has become mission critical. Progress is increasingly defined not just by efficiency, but by how many tokens an AI factory can push through to unlock the next wave of model capabilities.

AI-optimized data formats have emerged as a key innovation in this effort. Narrow-precision computation has already transformed inference, with NVIDIA’s introduction to NVFP4, a 4-bit format purpose-built to deliver exceptional inference latency, throughput, and efficiency—all while maintaining production-grade accuracy.

Now, NVIDIA is extending this innovation to the pretraining phase, marking a major leap forward in LLM development. Using NVFP4 for pretraining unlocks huge improvements in training LLMs at scale and overall infrastructure efficiency. This isn’t just an incremental optimization—it’s a foundational shift in how large models can be trained at scale.

In the era of AI factories, where compute is the engine of progress, precision is no longer a backend detail—it’s a strategic advantage. NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development.

NVFP4 training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pretraining. Active engagements and continued collaboration around NVFP4 are ongoing with leading organizations such as Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway.

What is 4-bit quantization?

4-bit quantization refers to the process of reducing the precision of model weights and activations to just 4 bits—a dramatic drop from the typical 16-bit or 32-bit floating-point formats.

Pretraining with 4 bits is challenging because gradients and updates must be handled very carefully to preserve accuracy while improving the overall training speed. Specialized techniques and recipes are required to maintain effectiveness while mapping high-precision tensors to a much smaller set of quantized values.

How fewer bits unlock more capability for AI factories

In recent years, AI workloads have grown exponentially—not just in the deployment of large language models (LLMs) but also in the scale of foundation model pretraining and post-training. As organizations expand compute infrastructure to handle training and deployment of multi-billion-parameter models, progress is increasingly defined by how much token throughput an AI factory can sustain to unlock new capabilities.

Inference has already undergone multiple waves of innovation, from FP32 and FP16 down to FP8 and most recently, NVIDIA’s release of NVFP4 for AI inference. While methods like post-training quantization (PTQ) have shown NVFP4 to be a force multiplier in increasing inference throughput while maintaining accuracy, a remaining challenge lies upstream in pretraining—where foundation models still rely on BF16 or FP8 for stability and convergence.

Training is where AI factories can spend the bulk of their compute, power, and time. Power budgets are fixed and GPU cycles are scarce, so developers must account for every bit, token, and epoch. Throughput isn’t an abstract metric here—it directly determines what scale of models can be built, how many experiments can be run, and how quickly breakthroughs arrive.

This is where 4-bit precision becomes transformative. By cutting memory needs, boosting arithmetic throughput, and optimizing communication, 4-bit pretraining allows factories to push significantly more tokens through the same hardware. With the right quantization recipe, it can deliver accuracy on par with FP8/BF16 while dramatically raising throughput—unlocking faster convergence cycles, more experiments per unit of compute, and scaling to unprecedented frontier models. In other words, fewer bits don’t just save money—they expand the frontier of what AI factories can achieve.

The NVFP4 quantization recipe for pretraining

To enable pretraining at 4-bit precision, we’ve developed a purpose-built NVFP4 pretraining recipe that addresses the core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training.

Blackwell was the first architecture from NVIDIA to natively support FP4 formats. The massive FP4 FLOPs throughput on GB200 and GB300 enables efficient 4-bit training by accelerating narrow-precision matrix operations while maintaining the scale and parallelism needed for large model convergence—making them ideal for next-generation AI factories deploying FP4-based pretraining.

Figure 1 below shows measured GEMM performance with Blackwell Ultra, revealing a 7x speedup over the Hopper generation. Modern LLMs fundamentally rely on matrix multiplication, particularly within their fully-connected or linear layers, as a core computational element. This makes the efficiency of these operations crucial. With FP4 precision enabling faster and more efficient execution of these operations, the observed GEMM acceleration means the entire pretraining process—from forward propagation to gradient updates—runs significantly faster, reducing time-to-train while enabling rapid larger-scale model development.

To enable efficient narrow-precision training, NVIDIA’s NVFP4 pretraining recipe leverages several key techniques which have been chosen based on their performance and accuracy. These include:

Enhanced value representation with NVFP4’s micro-block scaling: Blackwell introduces native Tensor Core support for NVFP4, a 4-bit numerical format for both weights and activations that uses micro-block scaling—where each group of sixteen 4-bit elements shares a common scaling factor. By reducing the block size from 32 to 16 elements compared to MXFP4, NVFP4 minimizes the influence of outliers and enables more precise scaling. This finer granularity reduces quantization error and improves overall model accuracy.

NVFP4 high-precision block encoding with E4M3 scale factors: Scale factor precision plays a critical role in quantization quality and accuracy. Unlike MXFP4, which is limited to power-of-two scale factors (E8M0) and prone to high rounding errors, NVFP4 uses higher-precision E4M3 scale factors with additional mantissa bits. This allows finer-grain scaling, better utilization of the limited quantization bins, and more accurate representation of values within a block.

Reshaping tensor distributions to fit narrow formats: Gradients and activations during LLM pretraining tend to have large outliers that can impact narrow-precision quantization. Applying Hadamard transforms to GEMM inputs helps reshape their distribution to be more Gaussian-like, which smooths outliers and makes tensors easier to represent accurately. These transformations are transparent to the model architecture and can be applied to linear layers in the forward and backward pass.

Maintaining fidelity with quantization techniques: To ensure stable and efficient training, we employ quantization methods that preserve consistency between the forward and backward passes. Techniques such as selective 2D block-based quantization help maintain alignment in tensor representations throughout the training cycle. This consistency is key to minimizing signal distortion, improving convergence behavior, and enhancing overall robustness—especially when operating under narrow-precision formats like NVFP4.

Reducing bias with stochastic rounding: Unlike traditional (deterministic) rounding where gradients are always rounded toward the nearest representable number, stochastic rounding ensures that gradients are rounded up or down randomly, with probabilities proportional to how close a number lies between two representable values. This step is essential for reducing rounding bias, maintaining gradient flow during training, and ultimately improving model accuracy.

NVFP4 Makes 4-Bit Pretraining Real: Accuracy and Stability at Trillion-Token Scale

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

We then took the 12B Hybrid Mamba-Transformer model pretrained using NVFP4 and compared it to the higher precision FP8 baseline across a range of downstream tasks and intelligence domains. Figure 4 illustrates that across all domains, NVFP4 matches the performance of FP8, highlighting its effectiveness. This finding strengthens the initial hypothesis: NVFP4 is a robust choice for pretraining LLMs even at the trillion-token scale—highlighting its potential for efficient large-scale frontier model training.

Train smarter, not just harder

NVIDIA’s NVFP4 format is redefining the landscape of AI training—setting a new benchmark for speed, efficiency, and purposeful innovation. By enabling 4-bit pretraining, NVFP4 empowers AI factories to scale more rapidly and sustainably, paving the way for the next era of generative AI. As a dynamic and evolving technology, NVFP4 continues to unlock new opportunities for teams building frontier models, driving progress in energy-efficient, high-performance AI. With its breakthrough in compute efficiency, 4-bit pretraining opens the door to more advanced architectures, larger training runs, and significantly more tokens—fueling the future of intelligent systems.

Discuss (0)

About the Authors

👁 Avatar photo

About Kirthi Devleker
Kirthi K. Devleker is a technology marketing leader at NVIDIA, where he drives the launch and positioning of transformative AI platforms and the GPU architectures that power them. He played a pivotal role in bringing NVIDIA’s groundbreaking Grace Blackwell architecture to market, including the Grace Blackwell and Grace Blackwell Ultra platforms—redefining performance, scalability, and efficiency for generative AI at global scale. Kirthi specializes in crafting compelling messages around NVIDIA’s datacenter GPU technologies, highlighting their performance advantages and ROI for enterprise AI adoption. Previously, at MathWorks, he led the global Medical Devices business unit and spearheaded strategic product management initiatives that guided the Signal Processing group’s roadmap towards AI. His leadership accelerated machine learning integration across medical devices, aerospace and defense and automotive sectors. As a recognized industry voice, Kirthi has delivered keynotes and technical talks at international conferences on AI-driven engineering and simulation. He holds a Master of Science in Electrical Engineering from San Jose State University, with a specialization in signal and image processing.

View all posts by Kirthi Devleker

👁 Avatar photo

About Farshad Ghodsian
Farshad Ghodsian is a senior technical marketing engineer at NVIDIA, where he focuses on AI training and inference at scale, performance optimization insights, new model releases, and AI engineering enablement. He brings a wealth of experience at the intersection of AI infrastructure, distributed training, GPU-accelerated computing and cloud-native MLOps—translating cutting-edge research into practical insights for developers, enterprise teams and business leaders. Prior to NVIDIA, Farshad held technical roles at leading semiconductor and consulting companies, where he helped build and manage large-scale generative AI and MLOps platforms for top technology customers.

View all posts by Farshad Ghodsian

URL: https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/