VOOZH about

URL: https://www.hardware-corner.net/llama-cpp-high-throughput-mode-20250731/

⇱ I Tested llama.cpp’s New Speed Boost Mode with an RTX 3090 – Here’s What I Found | Hardware Corner


I Tested llama.cpp’s New Speed Boost Mode with an RTX 3090 – Here’s What I Found

Allan Witt Jul 31, 2025 at 1:15pm PDT
💬 0 Comments
👁 llama screenshot of benchmark test with high throughput mode

The development pace in the local LLM scene is relentless, and the team behind llama.cpp has rolled out another interesting update: a new high-throughput mode. The key claim is that by changing how the KV cache is handled for multiple, parallel requests, we can see significant performance gains. As a hands-on enthusiast, I wanted to cut through the hype and see what this means for a common, value-oriented hardware setup like my own.

The core idea behind this update is a shift from a “unified” to a “split” KV cache. Previously, when running multiple requests (or “sequences”) at once, the GPU had to perform a lot of wasted calculations figuring out how these independent conversations related to each other, a process called “cross-sequence attention.” The new mode gives each sequence its own dedicated KV cache, eliminating that wasted effort. This is supposed to deliver faster speeds and use less RAM, especially in multi-user scenarios.

To see if this holds up, I ran a series of tests on my rig: an AMD EPYC 9334 CPU with 48 GB of RAM and a trusty 24GB NVIDIA RTX 3090, running on Ubuntu. To enable the new feature, you currently need to set the LLAMA_SET_ROWS=1 environment variable, though this is expected to become the default in the future.

Test 1: Single-User Performance

First, I wanted to know if this change makes any difference for the most common use case: a single person interacting with a model. The theory suggests there should be little to no benefit here, since there are no “other” sequences to worry about.

I used llama-server with the Open WebUI front-end to perform text summarization tasks. My models of choice were two popular options that fit nicely on a 24GB card: the Qwen2-30B-A3B-Instruct-2507 MoE (Q4_K_XL) and the older Qwen2-32B (Q4_K_XL). I tested with and without the high-throughput mode enabled across different context sizes.

Here are the averaged results from my tests.

Model Context Size High-Throughput Mode Prompt Processing (t/s) Token Generation (t/s) % Increase (PP/TG)
Qwen3 30B Instruct 4K Off 1680.2 t/s 90.28 t/s -0.12% / +9.8%
Qwen3 30B Instruct 4K On 1678.19 t/s 99.12 t/s
Qwen3 30B Instruct 16K Off 1169.72 t/s 53.46 t/s +0.04% / +6.4%
Qwen3 30B Instruct 16K On 1170.17 t/s 56.90 t/s
Qwen2 32B 4K Off 880.09 t/s 27.71 t/s +0.54% / +1.8%
Qwen2 32B 4K On 884.82 t/s 28.21 t/s
Qwen2 32B 8K Off 753.27 t/s 24.30 t/s +0.39% / +2.3%
Qwen2 32B 8K On 756.23 t/s 24.86 t/s

As you can see, the impact on prompt processing is practically zero, which aligns with the theory. For token generation, there are some minor gains, topping out around 10% on the 30B model with a 4K context, but in most cases, the difference is negligible. For a single user, this update doesn’t change the game.

Test 2: Multi-User & Batch Processing Performance

This is where the new high-throughput mode is supposed to shine. To simulate a multi-user or heavy batch processing workload, I used the llama-batched-bench tool included with llama.cpp. This tool fires off many prompts at once, which is the exact scenario the split KV cache is designed to optimize.

I ran these benchmarks on the Qwen2.5-3B-Coder and the Qwen2-32B model to see how the feature performs on both smaller and larger models.

Model Name Context Size (PP) High-Throughput Mode Prompt Processing (t/s) Token Generation (t/s) % Increase (PP/TG)
Qwen2.5 3B Coder 2048 Off 4144.11 t/s 1252.88 t/s +175.1% / +36.6%
Qwen2.5 3B Coder 2048 On 11402.03 t/s 1711.64 t/s
Qwen2.5 3B Coder 4096 Off 2334.23 t/s 787.27 t/s +336.6% / +70.9%
Qwen2.5 3B Coder 4096 On 10191.31 t/s 1345.14 t/s
Qwen2 32B 1024 Off 1238.48 t/s 91.62 t/s +6.9% / +3.5%
Qwen2 32B 1024 On 1323.94 t/s 94.84 t/s
Qwen2 32B 2048 Off 1142.61 t/s 88.09 t/s +13.3% / +3.8%
Qwen2 32B 2048 On 1294.79 t/s 91.47 t/s

The results here are dramatically different. On the smaller 3B model, the prompt processing speed increased by a staggering 175% to over 336% depending on the context length. Token generation also saw a healthy boost of up to 71%.

With the much larger 32B model, the improvements are more modest but still present, with prompt processing gaining up to 13% and token generation seeing a smaller 4% bump. The benefit is clearly more pronounced on smaller models where the overhead of cross-sequence attention was comparatively larger.

The Verdict

The new high-throughput mode in llama.cpp is a targeted and highly effective optimization. It’s not a magic bullet that will speed up every interaction, but it delivers on its promise for specific, parallel workloads.

For the solo enthusiast running interactive chat sessions, you won’t notice much of a change. However, if you are serving a model to multiple users, running batch inference jobs, or using parallel processing tools, enabling this feature is a clear win. It provides a significant performance uplift without any hardware changes, embodying the spirit of getting the most performance-per-dollar out of our systems.

👁 Google
Set as Preferred Source

Leave a Reply Cancel reply

No comments yet.