RTX 4090 LLM Benchmarks: Performance Across 4K β 131K Context Sizes
By | Updated: November 6, 2025
I tested the RTX 4090 with five quantized models to measure real-world inference performance for local LLM workloads. This is the second article in my GPU benchmark series, following my recent RTX 5090 tests. I ran these benchmarks to provide concrete performance data across different model sizes and context lengths using llama.cpp.
Testing Environment
My test system runs on an AMD EPYC 7642 48-core processor with 129GB of system memory. The setup uses Ubuntu 24.04.2 LTS with PyTorch 2.8.3 on CUDA 12.8. I performed all tests using llama-bench from llama.cpp, build 1bb4f433.
RTX 4090 Specifications
The RTX 4090 remains a strong option for local LLM inference with its 24GB VRAM capacity and high memory bandwidth.
| Specification | Value |
|---|---|
| Architecture | Ada Lovelace |
| Memory Size | 24 GB |
| Memory Type | GDDR6X |
| Memory Clock | 1313 MHz (21 Gbps effective) |
| Memory Bus | 384 bit |
| Bandwidth | 1.01 TB/s |
| Shading Units | 16384 |
| Tensor Cores | 512 |
| TDP | 450 W |
Software Stack
I used llama.cpp for all benchmarks because it provides consistent performance across different model architectures and supports efficient CUDA acceleration. The testing methodology involves running each model with full GPU offloading (ngl 99) and flash attention enabled (fa 1). I measured both prompt processing and token generation speeds across multiple context lengths to show how performance scales with working memory requirements.
Benchmark Methodology
I tested with llama-bench using flash attention enabled. The test methodology involves prefilling VRAM with varying amounts of tokens using the -d parameter, then passing additional tokens to measure both prompt processing and token generation performance. For prompt processing tests, I prefilled the context and passed an additional 2048 tokens. For token generation tests, I prefilled the context and generated 128 new tokens.
Command line used:
./llama-bench -m /home/allanw/model-name -fa 1 -d 4098,8196,32768,45062,139247 -p 2048 -n 128 -ngl 99
Prompt Processing Performance
Prompt processing speed determines how quickly the model can ingest and analyze input text. Higher token per second values here mean faster response times when working with large documents or long conversation histories.
| Model | Size (GiB) | 4K | 8K | 16K | 32K | 45K | 57K | 65K | 86K | 131K |
|---|---|---|---|---|---|---|---|---|---|---|
| 8B Q4_K_XL | 4.79 | 9121 | 7907 | 5697 | 4224 | 3410 | 2967 | 2614 | 1951 | 1451 |
| 14B Q4_K_XL | 8.53 | 5358 | 4621 | 3355 | 2511 | 2124 | 1667 | 1545 | 1190 | β |
| 30B MoE Q4_K_XL | 16.47 | 6266 | 5415 | 3802 | 2726 | 2290 | 1718 | β | β | β |
| 32B Q4_K_XL | 18.64 | 2393 | 2093 | 1685 | β | β | β | β | β | β |
The RTX 4090 handles prompt processing very efficiently. On the 8B model, I measured 9121 tokens/s at 4K context, and the 14B model still manages 5358 tokens/s. Smaller models like these are extremely responsive for short prompts, but the 4090βs 24 GB of VRAM becomes more important as model size increases, showing a clear performance drop-off with larger models as memory usage grows.
For me, the 30B MoE model hits the sweet spot. It can handle a full 57K token context, which is practical for general chat, content creation, or coding. Prompt processing runs at 1718 tokens/s, so the GPU can load the entire 57K sequence in about 35 seconds. The 32B model is still usable, but at 16K context weβre near the VRAM limit. It processes prompts at 1685 tokens/s, but the restricted context size makes it less flexible than the 30B MoE on this GPU.
Token Generation Performance
Token generation speed measures output performance during inference. This metric directly impacts how fast the model produces responses during actual use.
| Model | Size (GiB) | 4K | 8K | 16K | 32K | 45K | 57K | 65K | 86K | 131K |
|---|---|---|---|---|---|---|---|---|---|---|
| 8B Q4_K_XL | 4.79 | 131.00 | 119.36 | 96.12 | 77.42 | 66.18 | 62.81 | 53.07 | 43.01 | 32.27 |
| 14B Q4_K_XL | 8.53 | 82.82 | 77.46 | 63.21 | 55.49 | 48.11 | 42.99 | 38.73 | 32.87 | β |
| 30B MoE Q4_K_XL | 16.47 | 195.78 | 171.94 | 130.11 | 102.57 | 92.12 | 74.64 | β | β | β |
| 32B Q4_K_XL | 18.64 | 38.89 | 36.94 | 33.83 | β | β | β | β | β | β |
The MoE model stands out as the optimal choice for the RTX 4090, delivering exceptional performance across its usable context range. At 4K context, it reaches 195.78 tokens per second, and maintains 74.64 tokens per second even at 57K context. The mixture of experts architecture activates only a subset of parameters per token, providing great model capability while staying well within the 24GB VRAM envelope. For users wanting maximum performance without compromising on model quality, the 30B MoE represents the sweet spot on this hardware.
Performance Scaling with Context Length
Context length significantly impacts both prompt processing and token generation performance. The performance degradation follows a predictable pattern as context increases. For the 8B model, token generation drops from 131 tokens per second at 4K to 53.07 tokens per second at 65K context, representing a 59% reduction. The 14B model shows similar scaling, dropping from 82.82 to 38.73 tokens per second between 4K and 65K context.
The RTX 4090 demonstrates strong throughput across all tested models and context lengths, handling everything from the 8B model at extreme context lengths to the 32B model at 16K context. However, the 24GB VRAM ceiling creates practical limitations that position this card awkwardly in the current market. The VRAM constraint forces compromises on larger models, capping the 32B at 16K context where many real-world applications need more. This places the RTX 4090 between the RTX 3090, which delivers comparable performance at substantially lower second-hand prices, and the RTX 5090 with its 32GB VRAM that removes these context limitations entirely for models in this parameter range.
Model Size Considerations for 24GB VRAM
The 24GB of VRAM on the RTX 4090 largely determines which models and context lengths I can run efficiently. The 8B model at 4.79 GB leaves plenty of headroom, allowing very long contexts without any performance concerns. The 14B model at 8.53 GB also fits comfortably, supporting extended context windows with predictable scaling.
When moving to larger models, VRAM becomes the limiting factor. The 30B MoE model at 16.47 GB is where the card hits its sweet spot. It can handle up to 57K tokens with prompt processing at 1718 tokens/s, which is practical for chat, content creation, or coding workflows. Beyond that, the 32B model at 18.64 GB starts pushing the VRAM limit. I can still run it at 16K context, processing prompts at 1685 tokens/s, but higher contexts arenβt practical without running into memory constraints.
For anyone planning a local LLM setup, the RTX 4090 can handle models up to roughly 32B parameters in Q4_K quantization. Anything larger, like 70B models, will require either more aggressive quantization or a multi-GPU setup to stay usable.
RTX 4090 Pricing
Although the RTX 4090 remains one of the most capable GPUs for both gaming and AI workloads, its value proposition for local LLM enthusiasts has diminished with the arrival of the RTX 5090. While the 4090 originally launched with an MSRP of around $1599, real-world retail pricing continues to hover near the $2500 range β comparable to some RTX 5090 models despite the latter offering 32 GB of VRAM and significantly higher memory bandwidth (1790 GB/s vs. 1010 GB/s).
This makes the 4090 a less attractive option for heavy AI or LLM workloads, where VRAM capacity and bandwidth are key. However, the 4090 remains a powerhouse for gaming and mixed-use workloads. On the second-hand market, prices have stabilized closer to $2100, offering a more reasonable cost of entry than current retail listings.
| RTX 4090 Pricing Model | VRAM (GB) | Card Length (mm) | Current Price | Check Real-Time Price |
|---|---|---|---|---|
| Zotac GAMING AMP Extreme AIRO | 24 | 356 | $2574.94 | Check Current Price |
| Asus ROG STRIX GAMING | 24 | 358 | $2800.00 | Check Current Price |
| Asus ROG STRIX BTF GAMING OC | 24 | 358 | $2899.99 | β |
| Gigabyte GAMING OC | 24 | 340 | $2929.95 | β |
*Affiliate Disclaimer: Some links provided above are affiliate links. If you make a purchase through these links, I may earn a small commission, which helps support the continuation of this hardware testing.
Conclusion
The RTX 4090 delivers strong performance for local LLM inference across models from 8B to 32B parameters. The 8B and 14B models run efficiently with long contexts, while the MoE model shows excellent generation speeds. The 32B model fits but operates at the VRAM limit. For most local LLM workloads under 32B parameters, the RTX 4090 provides sufficient VRAM and bandwidth.
