llama.cpp Update Delivers Major Qwen3 Coder Next Token Speed Boost
A recent pull request to llama.cpp is delivering a measurable performance jump for recently released Qwen3 Coder Next, with tests showing a significant increase in both prompt processing and next token generation speeds. The largest gains are in token generation, which directly impacts real time coding and chat workflows.
The changes come from a compute graph rework that reduces unnecessary tensor copies and improves backend kernel behavior. CUDA and Metal both benefit, with additional improvements touching Vulkan and related GGML operations. For users running large quantized models locally, this translates into higher tokens per second without any hardware change.
What Changed in the PR
The pull request refactors the GGML compute graph to avoid redundant memory copies and enables better kernel selection. On CUDA, it enables CUDA graphs for Qwen3 Next style architectures and adjusts how fused operations are handled. On Metal, it introduces adaptive CPU GPU interleave, improves concurrency, and consolidates several kernel paths.
These changes affect both full GPU offload setups and split configurations where part of the model resides in VRAM and the rest in system memory.
CUDA Benchmarks: RTX 6000 Ada + RTX Pro 6000 Blackwell
Tests were run with an 80B Q8_0 build of Qwen3 Coder Next across two CUDA devices, an NVIDIA RTX 6000 Ada Generation and an NVIDIA RTX PRO 6000 Blackwell Workstation Edition.
Before the update, token generation was in the high 80 tokens per second range in dual GPU mode. After updating to the new build, token generation exceeded 118 tokens per second in the same configuration, and over 130 tokens per second on the RTX Pro alone.
Dual GPU, Q8_0, 80B
| Test | Old Build (t/s) | New Build (t/s) | Speedup |
|---|---|---|---|
| Prompt pp500 | 2470.78 | 2770.34 | 1.12x |
| Token tg32 | 87.35 | 118.63 | 1.36x |
| Token tg32 @ d500 | 85.99 | 119.69 | 1.39x |
| Token tg32 @ d1000 | 87.15 | 112.34 | 1.29x |
The prompt processing improvement is real, roughly 10 to 12 percent in this configuration. The token generation jump is much larger, roughly 30 to 40 percent depending on context depth.
Single RTX Pro 6000 Blackwell
| Test | New Build (t/s) |
|---|---|
| Prompt pp500 | 3563.60 |
| Token tg32 | 132.09 |
Going from roughly 80 tokens per second in earlier builds to 130 tokens per second represents about a 60 percent increase in next token throughput. For users who invested heavily in high end workstation GPUs, that is a meaningful reduction in waiting time during code generation.
DGX Spark and CUDA MXFP4 MoE
On a DGX Spark system using a GB10 GPU, the same PR shows gains with the MXFP4 MoE variant of Qwen3 Coder Next.
Before the update, tg32 measured around 34.9 tokens per second. After the update, it reached roughly 45.9 tokens per second.
| Test | Old (t/s) | New (t/s) | Speedup |
|---|---|---|---|
| Prompt pp500 | 1122.59 | 1242.33 | 1.11x |
| Token tg32 | 34.88 | 45.93 | 1.32x |
Again, prompt processing improved by around 10 percent, while token generation improved by over 30 percent.
Metal Backend: M2 Ultra Results
The PR also includes Metal benchmarks on an M2 Ultra system. For Q4_0 and Q8_0 variants of the 80B model, prompt processing and token generation both improved, typically in the 20 to 35 percent range.
For example, with Q8_0:
| Test | Old (t/s) | New (t/s) | Speedup |
|---|---|---|---|
| pp2048 | 1047.39 | 1338.82 | 1.28x |
| tg32 | 33.75 | 43.78 | 1.30x |
Metal users see similar proportional gains to CUDA users, especially for token generation. The adaptive CPU GPU interleave and improved concurrency appear to reduce overhead in longer sequences.
Full Offload and Split Model Impact
The improvements are visible in both full GPU offload and hybrid memory setups. In the dual GPU test, ngl 99 was used, indicating near full offload. The same speedups are reported when splitting model weights between GPU VRAM and system memory.
For enthusiasts running 70B to 80B class models across multiple 24 GB or 48 GB cards, this matters. If you are memory constrained and forced into partial offload, you still benefit from the compute graph and backend changes.
What This Means for Performance per Dollar
If you were previously seeing 20 tokens per second on a mid range CUDA card, a 30 percent uplift pushes that toward 26 tokens per second. On higher end hardware, jumping from 80 to 130 tokens per second changes the usability of large coding models.
For coding workloads with Qwen3 Coder Next, token generation speed is often the bottleneck, not prompt ingestion. This PR shifts that balance in the right direction.
Users should rebuild from the latest commit and pin their CUDA toolkit and driver versions when benchmarking. Kernel selection and graph optimizations can behave differently depending on driver and architecture.
For anyone running 80B class quantized models locally, especially Q4_0, Q4_K_M, or Q8_0 variants, this update is worth testing. The gains are not marginal. In many cases, next token generation is over 30 percent faster, with prompt processing also improved but to a lesser degree.
In short, this is a backend optimization that directly translates into better real world throughput on both CUDA and Metal, without requiring more VRAM or new hardware.
Read more
No comments yet.
