VOOZH about

URL: https://www.hardware-corner.net/llama-cpp-update-qwen3-speed-boost/

⇱ llama.cpp Update Delivers Major Qwen3 Coder Next Token Speed Boost | Hardware Corner


llama.cpp Update Delivers Major Qwen3 Coder Next Token Speed Boost

Allan Witt Feb 15, 2026 at 5:18am PDT
💬 0 Comments
👁 screenshot form the llamacpp pr with qwen3 next speed boost

A recent pull request to llama.cpp is delivering a measurable performance jump for recently released Qwen3 Coder Next, with tests showing a significant increase in both prompt processing and next token generation speeds. The largest gains are in token generation, which directly impacts real time coding and chat workflows.

The changes come from a compute graph rework that reduces unnecessary tensor copies and improves backend kernel behavior. CUDA and Metal both benefit, with additional improvements touching Vulkan and related GGML operations. For users running large quantized models locally, this translates into higher tokens per second without any hardware change.

What Changed in the PR

The pull request refactors the GGML compute graph to avoid redundant memory copies and enables better kernel selection. On CUDA, it enables CUDA graphs for Qwen3 Next style architectures and adjusts how fused operations are handled. On Metal, it introduces adaptive CPU GPU interleave, improves concurrency, and consolidates several kernel paths.

These changes affect both full GPU offload setups and split configurations where part of the model resides in VRAM and the rest in system memory.

CUDA Benchmarks: RTX 6000 Ada + RTX Pro 6000 Blackwell

Tests were run with an 80B Q8_0 build of Qwen3 Coder Next across two CUDA devices, an NVIDIA RTX 6000 Ada Generation and an NVIDIA RTX PRO 6000 Blackwell Workstation Edition.

Before the update, token generation was in the high 80 tokens per second range in dual GPU mode. After updating to the new build, token generation exceeded 118 tokens per second in the same configuration, and over 130 tokens per second on the RTX Pro alone.

Dual GPU, Q8_0, 80B

Test Old Build (t/s) New Build (t/s) Speedup
Prompt pp500 2470.78 2770.34 1.12x
Token tg32 87.35 118.63 1.36x
Token tg32 @ d500 85.99 119.69 1.39x
Token tg32 @ d1000 87.15 112.34 1.29x

The prompt processing improvement is real, roughly 10 to 12 percent in this configuration. The token generation jump is much larger, roughly 30 to 40 percent depending on context depth.

Single RTX Pro 6000 Blackwell

Test New Build (t/s)
Prompt pp500 3563.60
Token tg32 132.09

Going from roughly 80 tokens per second in earlier builds to 130 tokens per second represents about a 60 percent increase in next token throughput. For users who invested heavily in high end workstation GPUs, that is a meaningful reduction in waiting time during code generation.

DGX Spark and CUDA MXFP4 MoE

On a DGX Spark system using a GB10 GPU, the same PR shows gains with the MXFP4 MoE variant of Qwen3 Coder Next.

Before the update, tg32 measured around 34.9 tokens per second. After the update, it reached roughly 45.9 tokens per second.

Test Old (t/s) New (t/s) Speedup
Prompt pp500 1122.59 1242.33 1.11x
Token tg32 34.88 45.93 1.32x

Again, prompt processing improved by around 10 percent, while token generation improved by over 30 percent.

Metal Backend: M2 Ultra Results

The PR also includes Metal benchmarks on an M2 Ultra system. For Q4_0 and Q8_0 variants of the 80B model, prompt processing and token generation both improved, typically in the 20 to 35 percent range.

For example, with Q8_0:

Test Old (t/s) New (t/s) Speedup
pp2048 1047.39 1338.82 1.28x
tg32 33.75 43.78 1.30x

Metal users see similar proportional gains to CUDA users, especially for token generation. The adaptive CPU GPU interleave and improved concurrency appear to reduce overhead in longer sequences.

Full Offload and Split Model Impact

The improvements are visible in both full GPU offload and hybrid memory setups. In the dual GPU test, ngl 99 was used, indicating near full offload. The same speedups are reported when splitting model weights between GPU VRAM and system memory.

For enthusiasts running 70B to 80B class models across multiple 24 GB or 48 GB cards, this matters. If you are memory constrained and forced into partial offload, you still benefit from the compute graph and backend changes.

What This Means for Performance per Dollar

If you were previously seeing 20 tokens per second on a mid range CUDA card, a 30 percent uplift pushes that toward 26 tokens per second. On higher end hardware, jumping from 80 to 130 tokens per second changes the usability of large coding models.

For coding workloads with Qwen3 Coder Next, token generation speed is often the bottleneck, not prompt ingestion. This PR shifts that balance in the right direction.

Users should rebuild from the latest commit and pin their CUDA toolkit and driver versions when benchmarking. Kernel selection and graph optimizations can behave differently depending on driver and architecture.

For anyone running 80B class quantized models locally, especially Q4_0, Q4_K_M, or Q8_0 variants, this update is worth testing. The gains are not marginal. In many cases, next token generation is over 30 percent faster, with prompt processing also improved but to a lesser degree.

In short, this is a backend optimization that directly translates into better real world throughput on both CUDA and Metal, without requiring more VRAM or new hardware.

👁 Google
Set as Preferred Source

Read more

No comments yet.