Voozh

👁 NVIDIA RTX 5090 graphics card used for local LLM inference benchmarks, showing GPU performance visualization with data curves and grid background.

I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to provide the data necessary to judge the performance-per-dollar of this card, especially for those of you aiming for huge context depths or trying to avoid complex multi-GPU headaches.

My Testing Environment

I configured my test system specifically to eliminate any potential bottlenecks outside of the GPU itself, ensuring the 5090’s immense memory bandwidth was fully utilized..

Component	Specification
GPU	NVIDIA GeForce RTX 5090
CPU	AMD EPYC 7642 48-Core Processor
System Memory	129 GB DDR4 RAM
Operating System	Ubuntu 24.04.2 LTS
Software Stack	PyTorch 2.8.3 (CUDA 12.8)
Benchmark Tool	llama.cpp (llama-bench build 466c1911)

Quantization and Context Depth

I used llama-bench to test five different Qwen models, all utilizing the Q4_K_XL GGUF quantization format, ranging from 8 Billion up to 32 Billion parameters. The tests leveraged Flash Attention (-fa 1) to optimize speed, and I specifically measured performance against prefilled context depths (-d) to see how the card handles long-running sessions.

Prompt Processing (PP) measures memory bandwidth efficiency. I prefilled the context (e.g., 4098 tokens) and then measured the speed of processing an additional 1024 token prompt (-p 1024).

Token Generation (TG) measures sustained inference speed. I measured the speed of generating 128 new tokens (-n 128) after the prompt was processed.

The command lines I used for data collection were adjusted per model to match their respective sizes and context capabilities. For example Qwen3 8B command line was this:

./llama-bench -m /home/allanw/llama.cpp/Qwen3-8B-128K-UD-Q4_K_XL.gguf -fa 1 -d 4098,8196,16384,32768,45062,57356,65536,86026,131072 -p 1024 -n 128 -ngl 99

Benchmark Results: Prompt Processing Speed

Prompt processing speed is critical for Recursive Retrieval (RAG) and handling long conversational histories. This is where the 5090’s memory bandwidth truly shines. The table below shows the tokens per second (t/s) when processing a new 2048 token prompt against existing context.

Benchmark Results: Prompt Processing Speed
Model	Ctx 4K (t/s)	Ctx 8K (t/s)	Ctx 32K (t/s)	Ctx 65K (t/s)
Qwen3 8B	10406.47	8745.32	3687.76	2211.55
Qwen3 14B	6497.55	5593.80	2908.42	1707.24
Qwen3moe 30B.A3B	6630.42	5799.28	2877.53	1512.46
Qwen3 32B	2931.32	2530.12	1451.08	–
gpt-oss 20B	9443.81	8379.13	5183.08	3019.74

The performance here is outstanding. I saw the Qwen3 8B model hit over 10,400 tokens/second on prefill, demonstrating the massive throughput capability of the 5090. Even when I load the dense 32B model (18.64 GiB VRAM footprint), I still manage nearly 3,000 t/s at the 4k context mark. Critically, the dense 32B model still processes prompts at over 1,450 t/s even when the context window is prefilled to 32,768 tokens.

Benchmark Results: Token Generation Throughput

Token generation is what determines real-time chat responsiveness. This table shows the sustained inference speed measured in tokens per second (t/s) while generating 128 tokens.

Benchmark Results: Token Generation Throughput
Model	Size (GiB)	Ctx 4K (t/s)	Ctx 8K (t/s)	Ctx 32K (t/s)
Qwen3 8B	4.78	185.91	169.78	111.91
Qwen3 14B	8.53	123.79	115.45	82.35
Qwen3moe 30B.A3B	16.47	234.30	170.49	110.65
Qwen3 32B	18.64	61.38	55.53	43.82

Large Context Testing: 147K Tokens

One of the most impressive aspects of the RTX 5090’s 32 GB VRAM is its ability to sustain massive context windows without falling back to system RAM or disk offloading — both of which severely impact performance. I was able to push the Qwen3moe 30B model up to 147,000 tokens out of its 262K maximum context, entirely within VRAM.

Extreme Context Testing: 147K Tokens
Model	VRAM Used (GB)	Context	PP 2048 (t/s)	TG 128 (t/s)
Qwen3 8B	23	131K	948.40	49.44
Qwen3 14B	31	131K	908.39	37.20
Qwen3moe 30B.A3B	31	147K	666.24	52.28
Qwen3 32B	31	45K	666.24	52.28
gpt-oss 20B	15	131K	1636.40	112.01

At 147K tokens, VRAM usage peaked at 31 GB, fully contained on a single RTX 5090. Even at this extreme context length, inference remained stable at ~52 tokens per second.

I was able to push three out of five test models – Qwen3 8B, Qwen3 14B, and gpt-oss 120B – to their maximum context limits. The Qwen3 32B dense variant managed a still-impressive 47K tokens, filling nearly all available VRAM.

The standout performer, however, was gpt-oss 120B: it handled 131K tokens at a blistering 1,600 tokens per second prompt processing speed, completing the full context load in roughly 80 seconds. Inference throughput was equally impressive at 112 tokens per second – exceptionally fast for this scale.

MoE Performance

For hardware enthusiasts focused on performance-per-watt, the comparison between the dense 32B model and the sparse 30B MoE model is key.

Despite having similar total parameter counts, the 30B MoE model is significantly faster in token generation, achieving 234 t/s at short context lengths. This speed even surpasses the dense 8B model’s 185 t/s throughput. This confirms that the MoE architecture, which only activates a fraction of weights per token, pairs exceptionally well with the 5090’s high memory bandwidth, allowing rapid swapping of the necessary weights.

Running the full dense Qwen3 32B model resulted in 61 t/s at 4k context. For a model of this complexity and size running on a single consumer card, this is a fantastic result, proving that the 32 GB VRAM is sufficient to handle large, single-instance models without resorting to a multi-GPU configuration.

RTX 5090 Pricing

While local LLM enthusiasts prioritize performance-per-dollar, the initial cost of entry for the 32 GB RTX 5090 remains substantial, placing it well above the hypothetical $2000 MSRP range. Current pricing across various AIB partners starts near $2500 and extends upwards, reflecting high demand for this level of VRAM and memory bandwidth. It appears this pricing tier is stable and unlikely to decrease significantly anytime soon, making it a premium investment; however, since hardware pricing is dynamic, it is always wise to check real-time costs before making a purchasing decision.

RTX 5090 Pricing
Model	VRAM (GB)	Card Length (mm)	Current Price	Check Real-Time Price
MSI GAMING TRIO OC	32	359	$2499.99	Check Current Price
PNY EPIC-X RGB OC	32	329	$2499.99	Check Current Price
Gigabyte AORUS MASTER	32	360	$2599.99	Check Current Price
Zotac GAMING SOLID OC	32	330	$2695.95	Check Current Price

*Affiliate Disclaimer: Some links provided above are affiliate links. If you make a purchase through these links, I may earn a small commission, which helps support the continuation of this hardware testing.

Summary

The RTX 5090 32 GB is a powerful tool for the local LLM user looking to simplify their setup while targeting formerly inaccessible performance tiers. Its high VRAM capacity and exceptional memory bandwidth allow it to efficiently run large, quantized models (up to 32B Q4_K) and sustain extreme context windows well over 100K tokens. If your use case demands single-card simplicity coupled with maximum context depth, the 5090 offers a compelling value proposition in raw LLM inference power.

URL: https://www.hardware-corner.net/rtx-5090-llm-benchmarks/

⇱ RTX 5090 LLM Benchmark Results: 10K Tokens/sec Prompt Processing, 139K Context

RTX 5090 LLM Benchmark Results: 10K Tokens/sec Prompt Processing, 139K Context