VOOZH about

URL: https://www.hardware-corner.net/llamacpp-blackwell-seed-boost/

⇱ Huge Speed Boost for GPT-OSS Models on Blackwell GPUs with llama.cpp | Hardware Corner


Huge Speed Boost for GPT-OSS Models on Blackwell GPUs with llama.cpp

Allan Witt Jan 7, 2026 at 4:48am PDT
💬 0 Comments
👁 Image

Local LLM inference continues to move fast, and the latest llama.cpp updates are a good example of why running models on your own hardware keeps getting more attractive. Recent changes focused on NVIDIA Blackwell GPUs bring a clear improvement to both prompt processing and token generation, especially for GPT-OSS models. This article looks only at local inference and what these changes mean in practice for enthusiasts running quantized models at home or in small lab setups.

The short version is that llama.cpp now makes much better use of Blackwell GPU features, most notably FP4 level acceleration inside the tensor cores. The result is a noticeable reduction in the traditional gap between prompt processing speed and token generation speed, something that has been a long-standing pain point for large context workloads.

What Changed in llama.cpp for Local Inference

The recent llama.cpp updates are not cosmetic. They target the actual inference path.

Prompt processing benefits most from native MXFP4 support on Blackwell. This allows parts of the prefill stage to run at much higher throughput by using FP4 tensor core instructions directly instead of falling back to less efficient paths. Token generation also improves, but the relative gains are usually smaller because it was already closer to being bandwidth bound.

Other changes that matter for local users include GPU based token sampling, concurrent CUDA streams for QKV projections when CUDA graph optimization is enabled, and MMVQ kernel tuning that keeps the GPU busy instead of stalling on memory waits. Faster model loading is also part of the update, but that affects startup time rather than steady state inference.

All tests discussed below were done with CUDA graph optimization enabled at runtime using the GGML_CUDA_GRAPH_OPT environment variable.

Test Hardware and Models

Two Blackwell GPUs were tested to reflect realistic local builds at very different budget levels.

The RTX 5060 Ti 16 GB was used with GPT-OSS 20B in MXFP4 quantization. This card is currently one of the better performance per dollar options for mid size local models that fit comfortably in 16 GB of VRAM.

The RTX Pro 6000 Blackwell 96GB Workstation Edition was used with GPT-OSS 120B in MXFP4 quantization. This is a high end setup, but it represents what a single card workstation can do today without going multi GPU.

GPT-OSS 20B on RTX 5060 Ti 16 GB

For the 20B model, the biggest gains show up in prompt processing, especially at smaller and medium context sizes. Token generation improves more modestly, but the trend changes as context grows.

Context Size PP Old (t/s) PP New (t/s) PP Change TG Old (t/s) TG New (t/s) TG Change
4k 3585 4534 +26.5% 92.1 94.0 +2.1%
32k 1738 1963 +13.0% 73.2 80.0 +9.3%
~120k 685 729 +6.4% 43.8 56.1 +28.2%

*PP – prompt processing; TG – token generation

At low context, the update mostly helps prompt ingestion. At very large context sizes, token generation benefits more, which helps long interactive sessions where generation dominates total runtime. For a 16 GB card, this is a solid uplift without changing quantization or model choice.

GPT-OSS 120B on RTX Pro 6000 Blackwell

On the RTX Pro 6000 Blackwell, GPT-OSS 120B shows a clear improvement with the latest llama.cpp updates. The gains are concentrated almost entirely in prompt processing, while token generation stays essentially the same.

At smaller context sizes, prompt processing throughput increases by roughly 30 to 35 percent, which noticeably reduces time to first token. As the context grows, the uplift gradually tapers off, landing closer to a high single digit improvement at very large contexts. Even so, prompt ingestion remains faster across the board.

Context Size PP Old (t/s) PP New (t/s) PP Change TG Old (t/s) TG New (t/s) TG Change
4k 4758 6495 +36.5% 211.2 210.7 −0.2%
8k 4448 5912 +32.9% 183.2 182.8 −0.2%
16k 3955 5147 +30.1% 176.9 176.5 −0.2%
32k 3148 3699 +17.5% 163.7 163.4 −0.2%
64k 2022 2297 +13.6% 142.7 142.7 +0.0%
131k 1094 1177 +7.6% 113.7 113.7 +0.0%

*PP – prompt processing; TG – token generation

Token generation does not meaningfully change. Decode speed is effectively flat before and after the update.

Overall, this update makes running GPT-OSS 120B locally feel more responsive, especially for workflows that rely on large prompts or long conversation histories. The benefit is lower latency, not higher tokens per second, which is exactly where large models tend to feel slow in day to day use.

Why Blackwell Matters for Local LLMs

The key takeaway is that these gains are not generic CUDA speedups. They come from using hardware features that only exist on Blackwell GPUs. FP4 tensor cores do not translate directly to older architectures, and they do not exist on most non NVIDIA hardware.

Some optimizations, like better kernel fusion and sampling changes, can benefit other platforms indirectly. AMD users have already seen gradual improvements through ROCm builds, but architecture specific features will always favor the hardware they are designed for.

For local users, this means Blackwell shifts the performance per dollar curve again. A midrange Blackwell card can now handle prompt heavy workloads more comfortably than previous generations, and high end cards push large quantized models closer to practical interactive use.

Practical Advice for Local Builders

If you are already running llama.cpp on a Blackwell GPU, updating is an easy win. The gains are real, especially for GPT-OSS models.

If you are planning a new build, these results strengthen the case for a single strong GPU over older GPU setups.

The steady progress in llama.cpp over the past year shows that local inference is not slowing down. With better use of modern GPU features, the gap between consumer hardware and datacenter level performance continues to shrink, one update at a time.

👁 Google
Set as Preferred Source

No comments yet.