I Tested llama.cpp’s New Speed Boost Mode with an RTX 3090 – Here’s What I Found
The development pace in the local LLM scene is relentless, and the team behind llama.cpp has rolled out another interesting update: a new high-throughput mode. The key claim is that by changing how the KV cache is handled for multiple, parallel requests, we can see significant performance gains. As a hands-on enthusiast, I wanted to cut through the hype and see what this means for a common, value-oriented hardware setup like my own.
The core idea behind this update is a shift from a “unified” to a “split” KV cache. Previously, when running multiple requests (or “sequences”) at once, the GPU had to perform a lot of wasted calculations figuring out how these independent conversations related to each other, a process called “cross-sequence attention.” The new mode gives each sequence its own dedicated KV cache, eliminating that wasted effort. This is supposed to deliver faster speeds and use less RAM, especially in multi-user scenarios.
To see if this holds up, I ran a series of tests on my rig: an AMD EPYC 9334 CPU with 48 GB of RAM and a trusty 24GB NVIDIA RTX 3090, running on Ubuntu. To enable the new feature, you currently need to set the LLAMA_SET_ROWS=1 environment variable, though this is expected to become the default in the future.
Test 1: Single-User Performance
First, I wanted to know if this change makes any difference for the most common use case: a single person interacting with a model. The theory suggests there should be little to no benefit here, since there are no “other” sequences to worry about.
I used llama-server with the Open WebUI front-end to perform text summarization tasks. My models of choice were two popular options that fit nicely on a 24GB card: the Qwen2-30B-A3B-Instruct-2507 MoE (Q4_K_XL) and the older Qwen2-32B (Q4_K_XL). I tested with and without the high-throughput mode enabled across different context sizes.
Here are the averaged results from my tests.
| Model | Context Size | High-Throughput Mode | Prompt Processing (t/s) | Token Generation (t/s) | % Increase (PP/TG) |
| Qwen3 30B Instruct | 4K | Off | 1680.2 t/s | 90.28 t/s | -0.12% / +9.8% |
| Qwen3 30B Instruct | 4K | On | 1678.19 t/s | 99.12 t/s | |
| Qwen3 30B Instruct | 16K | Off | 1169.72 t/s | 53.46 t/s | +0.04% / +6.4% |
| Qwen3 30B Instruct | 16K | On | 1170.17 t/s | 56.90 t/s | |
| Qwen2 32B | 4K | Off | 880.09 t/s | 27.71 t/s | +0.54% / +1.8% |
| Qwen2 32B | 4K | On | 884.82 t/s | 28.21 t/s | |
| Qwen2 32B | 8K | Off | 753.27 t/s | 24.30 t/s | +0.39% / +2.3% |
| Qwen2 32B | 8K | On | 756.23 t/s | 24.86 t/s |
As you can see, the impact on prompt processing is practically zero, which aligns with the theory. For token generation, there are some minor gains, topping out around 10% on the 30B model with a 4K context, but in most cases, the difference is negligible. For a single user, this update doesn’t change the game.
Test 2: Multi-User & Batch Processing Performance
This is where the new high-throughput mode is supposed to shine. To simulate a multi-user or heavy batch processing workload, I used the llama-batched-bench tool included with llama.cpp. This tool fires off many prompts at once, which is the exact scenario the split KV cache is designed to optimize.
I ran these benchmarks on the Qwen2.5-3B-Coder and the Qwen2-32B model to see how the feature performs on both smaller and larger models.
| Model Name | Context Size (PP) | High-Throughput Mode | Prompt Processing (t/s) | Token Generation (t/s) | % Increase (PP/TG) |
| Qwen2.5 3B Coder | 2048 | Off | 4144.11 t/s | 1252.88 t/s | +175.1% / +36.6% |
| Qwen2.5 3B Coder | 2048 | On | 11402.03 t/s | 1711.64 t/s | |
| Qwen2.5 3B Coder | 4096 | Off | 2334.23 t/s | 787.27 t/s | +336.6% / +70.9% |
| Qwen2.5 3B Coder | 4096 | On | 10191.31 t/s | 1345.14 t/s | |
| Qwen2 32B | 1024 | Off | 1238.48 t/s | 91.62 t/s | +6.9% / +3.5% |
| Qwen2 32B | 1024 | On | 1323.94 t/s | 94.84 t/s | |
| Qwen2 32B | 2048 | Off | 1142.61 t/s | 88.09 t/s | +13.3% / +3.8% |
| Qwen2 32B | 2048 | On | 1294.79 t/s | 91.47 t/s |
The results here are dramatically different. On the smaller 3B model, the prompt processing speed increased by a staggering 175% to over 336% depending on the context length. Token generation also saw a healthy boost of up to 71%.
With the much larger 32B model, the improvements are more modest but still present, with prompt processing gaining up to 13% and token generation seeing a smaller 4% bump. The benefit is clearly more pronounced on smaller models where the overhead of cross-sequence attention was comparatively larger.
The Verdict
The new high-throughput mode in llama.cpp is a targeted and highly effective optimization. It’s not a magic bullet that will speed up every interaction, but it delivers on its promise for specific, parallel workloads.
For the solo enthusiast running interactive chat sessions, you won’t notice much of a change. However, if you are serving a model to multiple users, running batch inference jobs, or using parallel processing tools, enabling this feature is a clear win. It provides a significant performance uplift without any hardware changes, embodying the spirit of getting the most performance-per-dollar out of our systems.
Read more
Microsoft Maia 200 and the Quiet Shift Toward LLM Inference Silicon
Local LLM VRAM Race: Can AMD’s AT0 Take the Lead From NVIDIA With a 512-Bit Bus?
Qwen Unveils 480B Coder LLM and New Command-Line Tool for Local Use
No comments yet.

Leave a Reply Cancel reply