RTX 5090 LLM Benchmark Results: 10K Tokens/sec Prompt Processing, 139K Context
By Allan Witt | Updated: November 6, 2025
I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to provide the data necessary to judge the performance-per-dollar of this card, especially for those of you aiming for huge context depths or trying to avoid complex multi-GPU headaches.
My Testing Environment
I configured my test system specifically to eliminate any potential bottlenecks outside of the GPU itself, ensuring the 5090βs immense memory bandwidth was fully utilized..
| Component | Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 |
| CPU | AMD EPYC 7642 48-Core Processor |
| System Memory | 129 GB DDR4 RAM |
| Operating System | Ubuntu 24.04.2 LTS |
| Software Stack | PyTorch 2.8.3 (CUDA 12.8) |
| Benchmark Tool | llama.cpp (llama-bench build 466c1911) |
Quantization and Context Depth
I used llama-bench to test five different Qwen models, all utilizing the Q4_K_XL GGUF quantization format, ranging from 8 Billion up to 32 Billion parameters. The tests leveraged Flash Attention (-fa 1) to optimize speed, and I specifically measured performance against prefilled context depths (-d) to see how the card handles long-running sessions.
Prompt Processing (PP) measures memory bandwidth efficiency. I prefilled the context (e.g., 4098 tokens) and then measured the speed of processing an additional 1024 token prompt (-p 1024).
Token Generation (TG) measures sustained inference speed. I measured the speed of generating 128 new tokens (-n 128) after the prompt was processed.
The command lines I used for data collection were adjusted per model to match their respective sizes and context capabilities. For example Qwen3 8B command line was this:
./llama-bench -m /home/allanw/llama.cpp/Qwen3-8B-128K-UD-Q4_K_XL.gguf -fa 1 -d 4098,8196,16384,32768,45062,57356,65536,86026,131072 -p 1024 -n 128 -ngl 99
Benchmark Results: Prompt Processing Speed
Prompt processing speed is critical for Recursive Retrieval (RAG) and handling long conversational histories. This is where the 5090βs memory bandwidth truly shines. The table below shows the tokens per second (t/s) when processing a new 2048 token prompt against existing context.
| Model | Ctx 4K (t/s) | Ctx 8K (t/s) | Ctx 32K (t/s) | Ctx 65K (t/s) |
|---|---|---|---|---|
| Qwen3 8B | 10406.47 | 8745.32 | 3687.76 | 2211.55 |
| Qwen3 14B | 6497.55 | 5593.80 | 2908.42 | 1707.24 |
| Qwen3moe 30B.A3B | 6630.42 | 5799.28 | 2877.53 | 1512.46 |
| Qwen3 32B | 2931.32 | 2530.12 | 1451.08 | β |
| gpt-oss 20B | 9443.81 | 8379.13 | 5183.08 | 3019.74 |
The performance here is outstanding. I saw the Qwen3 8B model hit over 10,400 tokens/second on prefill, demonstrating the massive throughput capability of the 5090. Even when I load the dense 32B model (18.64 GiB VRAM footprint), I still manage nearly 3,000 t/s at the 4k context mark. Critically, the dense 32B model still processes prompts at over 1,450 t/s even when the context window is prefilled to 32,768 tokens.
Benchmark Results: Token Generation Throughput
Token generation is what determines real-time chat responsiveness. This table shows the sustained inference speed measured in tokens per second (t/s) while generating 128 tokens.
| Model | Size (GiB) | Ctx 4K (t/s) | Ctx 8K (t/s) | Ctx 32K (t/s) |
|---|---|---|---|---|
| Qwen3 8B | 4.78 | 185.91 | 169.78 | 111.91 |
| Qwen3 14B | 8.53 | 123.79 | 115.45 | 82.35 |
| Qwen3moe 30B.A3B | 16.47 | 234.30 | 170.49 | 110.65 |
| Qwen3 32B | 18.64 | 61.38 | 55.53 | 43.82 |
Large Context Testing: 147K Tokens
One of the most impressive aspects of the RTX 5090βs 32 GB VRAM is its ability to sustain massive context windows without falling back to system RAM or disk offloading β both of which severely impact performance. I was able to push the Qwen3moe 30B model up to 147,000 tokens out of its 262K maximum context, entirely within VRAM.
| Model | VRAM Used (GB) | Context | PP 2048 (t/s) | TG 128 (t/s) |
|---|---|---|---|---|
| Qwen3 8B | 23 | 131K | 948.40 | 49.44 |
| Qwen3 14B | 31 | 131K | 908.39 | 37.20 |
| Qwen3moe 30B.A3B | 31 | 147K | 666.24 | 52.28 |
| Qwen3 32B | 31 | 45K | 666.24 | 52.28 |
| gpt-oss 20B | 15 | 131K | 1636.40 | 112.01 |
At 147K tokens, VRAM usage peaked at 31 GB, fully contained on a single RTX 5090. Even at this extreme context length, inference remained stable at ~52 tokens per second.
I was able to push three out of five test models β Qwen3 8B, Qwen3 14B, and gpt-oss 120B β to their maximum context limits. The Qwen3 32B dense variant managed a still-impressive 47K tokens, filling nearly all available VRAM.
The standout performer, however, was gpt-oss 120B: it handled 131K tokens at a blistering 1,600 tokens per second prompt processing speed, completing the full context load in roughly 80 seconds. Inference throughput was equally impressive at 112 tokens per second β exceptionally fast for this scale.
MoE Performance
For hardware enthusiasts focused on performance-per-watt, the comparison between the dense 32B model and the sparse 30B MoE model is key.
Despite having similar total parameter counts, the 30B MoE model is significantly faster in token generation, achieving 234 t/s at short context lengths. This speed even surpasses the dense 8B modelβs 185 t/s throughput. This confirms that the MoE architecture, which only activates a fraction of weights per token, pairs exceptionally well with the 5090βs high memory bandwidth, allowing rapid swapping of the necessary weights.
Running the full dense Qwen3 32B model resulted in 61 t/s at 4k context. For a model of this complexity and size running on a single consumer card, this is a fantastic result, proving that the 32 GB VRAM is sufficient to handle large, single-instance models without resorting to a multi-GPU configuration.
RTX 5090 Pricing
While local LLM enthusiasts prioritize performance-per-dollar, the initial cost of entry for the 32 GB RTX 5090 remains substantial, placing it well above the hypothetical $2000 MSRP range. Current pricing across various AIB partners starts near $2500 and extends upwards, reflecting high demand for this level of VRAM and memory bandwidth. It appears this pricing tier is stable and unlikely to decrease significantly anytime soon, making it a premium investment; however, since hardware pricing is dynamic, it is always wise to check real-time costs before making a purchasing decision.
| Model | VRAM (GB) | Card Length (mm) | Current Price | Check Real-Time Price |
|---|---|---|---|---|
| MSI GAMING TRIO OC | 32 | 359 | $2499.99 | Check Current Price |
| PNY EPIC-X RGB OC | 32 | 329 | $2499.99 | Check Current Price |
| Gigabyte AORUS MASTER | 32 | 360 | $2599.99 | Check Current Price |
| Zotac GAMING SOLID OC | 32 | 330 | $2695.95 | Check Current Price |
*Affiliate Disclaimer: Some links provided above are affiliate links. If you make a purchase through these links, I may earn a small commission, which helps support the continuation of this hardware testing.
Summary
The RTX 5090 32 GB is a powerful tool for the local LLM user looking to simplify their setup while targeting formerly inaccessible performance tiers. Its high VRAM capacity and exceptional memory bandwidth allow it to efficiently run large, quantized models (up to 32B Q4_K) and sustain extreme context windows well over 100K tokens. If your use case demands single-card simplicity coupled with maximum context depth, the 5090 offers a compelling value proposition in raw LLM inference power.
