VOOZH about

URL: https://www.hardware-corner.net/rtx-4090-llm-benchmarks/

⇱ RTX 4090 LLM Benchmarks: Performance Across 4K – 131K Context Sizes


RTX 4090 LLM Benchmarks: Performance Across 4K – 131K Context Sizes

By | Updated: November 6, 2025

πŸ‘ NVIDIA GeForce RTX 4090 graphics card with performance benchmark graph background, illustrating powerful GPU performance for local LLM and AI model inference.

I tested the RTX 4090 with five quantized models to measure real-world inference performance for local LLM workloads. This is the second article in my GPU benchmark series, following my recent RTX 5090 tests. I ran these benchmarks to provide concrete performance data across different model sizes and context lengths using llama.cpp.

Testing Environment

My test system runs on an AMD EPYC 7642 48-core processor with 129GB of system memory. The setup uses Ubuntu 24.04.2 LTS with PyTorch 2.8.3 on CUDA 12.8. I performed all tests using llama-bench from llama.cpp, build 1bb4f433.

RTX 4090 Specifications

The RTX 4090 remains a strong option for local LLM inference with its 24GB VRAM capacity and high memory bandwidth.

Specification Value
Architecture Ada Lovelace
Memory Size 24 GB
Memory Type GDDR6X
Memory Clock 1313 MHz (21 Gbps effective)
Memory Bus 384 bit
Bandwidth 1.01 TB/s
Shading Units 16384
Tensor Cores 512
TDP 450 W

Software Stack

I used llama.cpp for all benchmarks because it provides consistent performance across different model architectures and supports efficient CUDA acceleration. The testing methodology involves running each model with full GPU offloading (ngl 99) and flash attention enabled (fa 1). I measured both prompt processing and token generation speeds across multiple context lengths to show how performance scales with working memory requirements.

Benchmark Methodology

I tested with llama-bench using flash attention enabled. The test methodology involves prefilling VRAM with varying amounts of tokens using the -d parameter, then passing additional tokens to measure both prompt processing and token generation performance. For prompt processing tests, I prefilled the context and passed an additional 2048 tokens. For token generation tests, I prefilled the context and generated 128 new tokens.

Command line used:

./llama-bench -m /home/allanw/model-name -fa 1 -d 4098,8196,32768,45062,139247 -p 2048 -n 128 -ngl 99

Prompt Processing Performance

Prompt processing speed determines how quickly the model can ingest and analyze input text. Higher token per second values here mean faster response times when working with large documents or long conversation histories.

Model Size (GiB) 4K 8K 16K 32K 45K 57K 65K 86K 131K
8B Q4_K_XL 4.79 9121 7907 5697 4224 3410 2967 2614 1951 1451
14B Q4_K_XL 8.53 5358 4621 3355 2511 2124 1667 1545 1190 –
30B MoE Q4_K_XL 16.47 6266 5415 3802 2726 2290 1718 – – –
32B Q4_K_XL 18.64 2393 2093 1685 – – – – – –

The RTX 4090 handles prompt processing very efficiently. On the 8B model, I measured 9121 tokens/s at 4K context, and the 14B model still manages 5358 tokens/s. Smaller models like these are extremely responsive for short prompts, but the 4090’s 24 GB of VRAM becomes more important as model size increases, showing a clear performance drop-off with larger models as memory usage grows.

For me, the 30B MoE model hits the sweet spot. It can handle a full 57K token context, which is practical for general chat, content creation, or coding. Prompt processing runs at 1718 tokens/s, so the GPU can load the entire 57K sequence in about 35 seconds. The 32B model is still usable, but at 16K context we’re near the VRAM limit. It processes prompts at 1685 tokens/s, but the restricted context size makes it less flexible than the 30B MoE on this GPU.

Token Generation Performance

Token generation speed measures output performance during inference. This metric directly impacts how fast the model produces responses during actual use.

Model Size (GiB) 4K 8K 16K 32K 45K 57K 65K 86K 131K
8B Q4_K_XL 4.79 131.00 119.36 96.12 77.42 66.18 62.81 53.07 43.01 32.27
14B Q4_K_XL 8.53 82.82 77.46 63.21 55.49 48.11 42.99 38.73 32.87 –
30B MoE Q4_K_XL 16.47 195.78 171.94 130.11 102.57 92.12 74.64 – – –
32B Q4_K_XL 18.64 38.89 36.94 33.83 – – – – – –

The MoE model stands out as the optimal choice for the RTX 4090, delivering exceptional performance across its usable context range. At 4K context, it reaches 195.78 tokens per second, and maintains 74.64 tokens per second even at 57K context. The mixture of experts architecture activates only a subset of parameters per token, providing great model capability while staying well within the 24GB VRAM envelope. For users wanting maximum performance without compromising on model quality, the 30B MoE represents the sweet spot on this hardware.

Performance Scaling with Context Length

Context length significantly impacts both prompt processing and token generation performance. The performance degradation follows a predictable pattern as context increases. For the 8B model, token generation drops from 131 tokens per second at 4K to 53.07 tokens per second at 65K context, representing a 59% reduction. The 14B model shows similar scaling, dropping from 82.82 to 38.73 tokens per second between 4K and 65K context.

The RTX 4090 demonstrates strong throughput across all tested models and context lengths, handling everything from the 8B model at extreme context lengths to the 32B model at 16K context. However, the 24GB VRAM ceiling creates practical limitations that position this card awkwardly in the current market. The VRAM constraint forces compromises on larger models, capping the 32B at 16K context where many real-world applications need more. This places the RTX 4090 between the RTX 3090, which delivers comparable performance at substantially lower second-hand prices, and the RTX 5090 with its 32GB VRAM that removes these context limitations entirely for models in this parameter range.

Model Size Considerations for 24GB VRAM

The 24GB of VRAM on the RTX 4090 largely determines which models and context lengths I can run efficiently. The 8B model at 4.79 GB leaves plenty of headroom, allowing very long contexts without any performance concerns. The 14B model at 8.53 GB also fits comfortably, supporting extended context windows with predictable scaling.

When moving to larger models, VRAM becomes the limiting factor. The 30B MoE model at 16.47 GB is where the card hits its sweet spot. It can handle up to 57K tokens with prompt processing at 1718 tokens/s, which is practical for chat, content creation, or coding workflows. Beyond that, the 32B model at 18.64 GB starts pushing the VRAM limit. I can still run it at 16K context, processing prompts at 1685 tokens/s, but higher contexts aren’t practical without running into memory constraints.

For anyone planning a local LLM setup, the RTX 4090 can handle models up to roughly 32B parameters in Q4_K quantization. Anything larger, like 70B models, will require either more aggressive quantization or a multi-GPU setup to stay usable.

RTX 4090 Pricing

Although the RTX 4090 remains one of the most capable GPUs for both gaming and AI workloads, its value proposition for local LLM enthusiasts has diminished with the arrival of the RTX 5090. While the 4090 originally launched with an MSRP of around $1599, real-world retail pricing continues to hover near the $2500 range β€” comparable to some RTX 5090 models despite the latter offering 32 GB of VRAM and significantly higher memory bandwidth (1790 GB/s vs. 1010 GB/s).

This makes the 4090 a less attractive option for heavy AI or LLM workloads, where VRAM capacity and bandwidth are key. However, the 4090 remains a powerhouse for gaming and mixed-use workloads. On the second-hand market, prices have stabilized closer to $2100, offering a more reasonable cost of entry than current retail listings.

RTX 4090 Pricing Model VRAM (GB) Card Length (mm) Current Price Check Real-Time Price
Zotac GAMING AMP Extreme AIRO 24 356 $2574.94 Check Current Price
Asus ROG STRIX GAMING 24 358 $2800.00 Check Current Price
Asus ROG STRIX BTF GAMING OC 24 358 $2899.99 –
Gigabyte GAMING OC 24 340 $2929.95 –

*Affiliate Disclaimer: Some links provided above are affiliate links. If you make a purchase through these links, I may earn a small commission, which helps support the continuation of this hardware testing.

Conclusion

The RTX 4090 delivers strong performance for local LLM inference across models from 8B to 32B parameters. The 8B and 14B models run efficiently with long contexts, while the MoE model shows excellent generation speeds. The 32B model fits but operates at the VRAM limit. For most local LLM workloads under 32B parameters, the RTX 4090 provides sufficient VRAM and bandwidth.

Read more: Run LLMs Locally