I Tested llama.cpp’s New Speed Boost Mode with an RTX 3090 – Here’s What I Found

Allan Witt • Jul 31, 2025 at 1:15pm PDT

💬 0 Comments

👁 llama screenshot of benchmark test with high throughput mode

The development pace in the local LLM scene is relentless, and the team behind llama.cpp has rolled out another interesting update: a new high-throughput mode. The key claim is that by changing how the KV cache is handled for multiple, parallel requests, we can see significant performance gains. As a hands-on enthusiast, I wanted to cut through the hype and see what this means for a common, value-oriented hardware setup like my own.

The core idea behind this update is a shift from a “unified” to a “split” KV cache. Previously, when running multiple requests (or “sequences”) at once, the GPU had to perform a lot of wasted calculations figuring out how these independent conversations related to each other, a process called “cross-sequence attention.” The new mode gives each sequence its own dedicated KV cache, eliminating that wasted effort. This is supposed to deliver faster speeds and use less RAM, especially in multi-user scenarios.

To see if this holds up, I ran a series of tests on my rig: an AMD EPYC 9334 CPU with 48 GB of RAM and a trusty 24GB NVIDIA RTX 3090, running on Ubuntu. To enable the new feature, you currently need to set the LLAMA_SET_ROWS=1 environment variable, though this is expected to become the default in the future.

Test 1: Single-User Performance

First, I wanted to know if this change makes any difference for the most common use case: a single person interacting with a model. The theory suggests there should be little to no benefit here, since there are no “other” sequences to worry about.

I used llama-server with the Open WebUI front-end to perform text summarization tasks. My models of choice were two popular options that fit nicely on a 24GB card: the Qwen2-30B-A3B-Instruct-2507 MoE (Q4_K_XL) and the older Qwen2-32B (Q4_K_XL). I tested with and without the high-throughput mode enabled across different context sizes.

Here are the averaged results from my tests.

Model	Context Size	High-Throughput Mode	Prompt Processing (t/s)	Token Generation (t/s)	% Increase (PP/TG)
Qwen3 30B Instruct	4K	Off	1680.2 t/s	90.28 t/s	-0.12% / +9.8%
Qwen3 30B Instruct	4K	On	1678.19 t/s	99.12 t/s
Qwen3 30B Instruct	16K	Off	1169.72 t/s	53.46 t/s	+0.04% / +6.4%
Qwen3 30B Instruct	16K	On	1170.17 t/s	56.90 t/s
Qwen2 32B	4K	Off	880.09 t/s	27.71 t/s	+0.54% / +1.8%
Qwen2 32B	4K	On	884.82 t/s	28.21 t/s
Qwen2 32B	8K	Off	753.27 t/s	24.30 t/s	+0.39% / +2.3%
Qwen2 32B	8K	On	756.23 t/s	24.86 t/s

As you can see, the impact on prompt processing is practically zero, which aligns with the theory. For token generation, there are some minor gains, topping out around 10% on the 30B model with a 4K context, but in most cases, the difference is negligible. For a single user, this update doesn’t change the game.

Test 2: Multi-User & Batch Processing Performance

This is where the new high-throughput mode is supposed to shine. To simulate a multi-user or heavy batch processing workload, I used the llama-batched-bench tool included with llama.cpp. This tool fires off many prompts at once, which is the exact scenario the split KV cache is designed to optimize.

I ran these benchmarks on the Qwen2.5-3B-Coder and the Qwen2-32B model to see how the feature performs on both smaller and larger models.

Model Name	Context Size (PP)	High-Throughput Mode	Prompt Processing (t/s)	Token Generation (t/s)	% Increase (PP/TG)
Qwen2.5 3B Coder	2048	Off	4144.11 t/s	1252.88 t/s	+175.1% / +36.6%
Qwen2.5 3B Coder	2048	On	11402.03 t/s	1711.64 t/s
Qwen2.5 3B Coder	4096	Off	2334.23 t/s	787.27 t/s	+336.6% / +70.9%
Qwen2.5 3B Coder	4096	On	10191.31 t/s	1345.14 t/s
Qwen2 32B	1024	Off	1238.48 t/s	91.62 t/s	+6.9% / +3.5%
Qwen2 32B	1024	On	1323.94 t/s	94.84 t/s
Qwen2 32B	2048	Off	1142.61 t/s	88.09 t/s	+13.3% / +3.8%
Qwen2 32B	2048	On	1294.79 t/s	91.47 t/s

The results here are dramatically different. On the smaller 3B model, the prompt processing speed increased by a staggering 175% to over 336% depending on the context length. Token generation also saw a healthy boost of up to 71%.

With the much larger 32B model, the improvements are more modest but still present, with prompt processing gaining up to 13% and token generation seeing a smaller 4% bump. The benefit is clearly more pronounced on smaller models where the overhead of cross-sequence attention was comparatively larger.

The Verdict

The new high-throughput mode in llama.cpp is a targeted and highly effective optimization. It’s not a magic bullet that will speed up every interaction, but it delivers on its promise for specific, parallel workloads.

For the solo enthusiast running interactive chat sessions, you won’t notice much of a change. However, if you are serving a model to multiple users, running batch inference jobs, or using parallel processing tools, enabling this feature is a clear win. It provides a significant performance uplift without any hardware changes, embodying the spirit of getting the most performance-per-dollar out of our systems.

👁 Google
Set as Preferred Source

👁 ms maia chip for llm inference in data center

URL: https://www.hardware-corner.net/llama-cpp-high-throughput-mode-20250731/