VOOZH about

URL: https://www.hardware-corner.net/gpu-ranking-local-llm/

⇱ The Definitive GPU Ranking for LLMs: Token Generation & Prompt Processing Performance


The Definitive GPU Ranking for LLMs: Token Generation & Prompt Processing Performance

By Allan Witt | Updated: December 9, 2025

At Hardware Corner, we set out to create a data-driven benchmark hierarchy for local LLM inference – focusing on the two workloads that define real-world performance: prompt processing and token generation. Using llama.cpp’s latest llama-bench on Ubuntu 24.04 with CUDA 12.8, we measured a wide range of GPUs across model sizes, context lengths, and quantization levels. The goal is simple: identify which GPUs deliver the best performance for local LLM users.

✅ Nov 18 Update: Updated prompt processing with 131k context. PP graph update.

✅ Nov 10 Update: Updated prompt processing table.

✅ Nov 9 Update: Fixed token generation error for RTX 4090; New token generation performance graph.

These benchmarks aren’t theoretical. Every number you’ll see here was captured on-hardware in our lab, using consistent configurations and repeatable testing. Whether you’re upgrading a single-GPU desktop or scaling a dual-GPU setup, this guide shows what actually matters for real inference workloads – and where your money gets you the most tokens per second.

GPU Token Generation Benchmarks

To measure real-world performance, we tested a range of consumer and professional GPUs with various open-weight models at different context lengths. All benchmarks were run on Ubuntu 24.04 using the latest version of llama.cpp’s llama-bench, with CUDA 12.8. We used 4-bit (Q4_K_XL) quantization for all of the benchmarks. The following data shows the token generation speeds we recorded.

👁 Scatter plot showing GPU token generation performance vs VRAM for 8B model at 16k context. Larger circles represent higher GPU prices, comparing models like RTX 5090, RTX 4090, RTX 4080, and RTX Pro 6000

Key Findings

  • Best Budget GPU: The NVIDIA RTX 5060 Ti (16GB) offers the best price-to-performance for running capable models.
  • Best Value for VRAM: A used NVIDIA RTX 3090 (24GB) is the top choice for a single-GPU setup, enabling larger 30B+ models at an unbeatable price-to-VRAM ratio.
  • Best for Cheap Scaling: A dual RTX 5060 Ti setup is a smart, budget-friendly path to 32GB of VRAM for running 34B models.
  • Top Consumer Performance: The RTX 5090 (32GB) is the new speed leader for consumer GPUs, significantly outperforming previous generations.
  • For Massive Models: The RTX PRO 6000 (96GB) is the only viable option for running the largest 70B-120B models, where memory capacity is the main priority.

GPU Benchmark Comparison: Token Generation Speed for 8B–123B Models from 16K to 131K Context

This benchmark table compares GPUs based on their token generation speed (tokens per second) when running large language models (LLMs) at a 16K context length. Each column represents a model – from Qwen3 8B to Mistral Large 123B at 4-bit quantization – and each value indicates the GPU’s generation throughput.

16K
32K
65K
131K

Table 16K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B Qwen3 32B gpt-oss 20B Llama 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX 5090 $2,499.00 145.34 102.68 141.63 50.92 249.19
RTX Pro 6000 WS $8,000.00 140.62 96.86 139.76 45.72 237.92 28.24 182.23 82.87 17.73
RTX 4090 48GB $3,100.00 106.22 68.10 137.13 32.57 145.63 17.79
RTX 4090 $2,574.00 104.31 69.14 139.71 37.68 163.92
RTX 6000 Ada $5,000.00 98.68 58.51 120.12 25.08 137.10 13.65
RTX 5080 $999.00 94.14 64.04 140.48
RTX 3090 Ti $1,669.00 93.60 56.90 121.88 32.61 137.55
RTX 3080 Ti $1,199.00 87.94 52.29
RTX 5070 Ti $749.00 87.54 57.98 133.05
RTX 3090 $1,499.00 87.45 52.14 113.84 30.28 128.51
RTX 4080 SUPER $1,597.00 79.36 52.76 122.96
RTX 4080 $1,349.00 77.86 51.22 120.16
RTX 3080 10GB $735.00 74.24
RTX 4070 Ti SUPER $1,148.00 72.20 47.18 113.60
RTX A6000 $3,650.00 64.26 40.69 75.35 18.32 91.19
RTX 5070 $579.00 59.13 40.59
RTX 4070 Ti $917.00 57.55 37.75
RTX 4070 SUPER $759.00 56.21 37.15
RTX 4070 $579.00 52.07 32.66
RTX 5060 Ti $430.00 51.41 32.91 82.42
RTX 3060 $250.00 41.97 22.66
RTX 4060 Ti $409.00 34.31 22.36 57.76

Table 32K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B Qwen3 32B gpt-oss 20B Llama 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX 5090 $2,499.00 111.91 82.35 110.65 43.82 215.08
RTX Pro 6000 WS $8,000.00 111.1 80.32 103.33 39.91 207.45 25.67 161.49 64.18 15.9
RTX 4090 48GB $3,100.00 81.73 57.01 95.23 25.84 126.04
RTX 4090 $2,574.00 78.42 55.49 105.57 140.34
RTX 6000 Ada $5,000.00 68.87 42.27 74.31 18.17 116.25
RTX 5080 $999.00 72.49 51.87 128.12
RTX 3090 Ti $1,499.00 71.75 91.97 119.69
RTX 3080 Ti $1,199.00 68.07
RTX 5070 Ti $749.00 63.3 45.54 115.88
RTX 3090 $1,499.00 67.88 38.64 87.21 112.55
RTX 4080 SUPER $1,597.00 59.47 42.64 107.19
RTX 4080 $1,349.00 58.98 40.56 106.02
RTX 3080 10GB $699.00 55.7
RTX 4070 Ti SUPER $1,148.00 54.5 37.72 98.66
RTX A6000 $3,650.00 44.55 30.75 48.13 13.52 74.41
RTX 5070 $579.00 43.6
RTX 4070 Ti $900.00 42.1
RTX 4070 SUPER $759.00 42.17
RTX 4070 $579.00 38.07
RTX 5060 Ti $430.00 38.93 25.85 73.21
RTX 3060 $250.00 31.86
RTX 4060 Ti $409.00 25.53 17.86 51.48

Table 65K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B Qwen3 32B gpt-oss 20B Llama 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX 5090 $2,499.00 80.14 57.58 76.56 168.95
RTX Pro 6000 WS $8,000.00 77.94 60.36 80.15 32.11 171.33 21.67 133.07 30.54 10.52
RTX 4090 48GB $3,100.00 55.70 41.01 59.20 18.08 99.14
RTX 4090 $2,574.00 53.07 38.73 103.01
RTX 6000 Ada $5,000.00 39.89 27.01 40.83 11.76 88.40
RTX 5080 $999.00 49 98.53
RTX 3090 Ti $1,499.00 48.98 28.2 94.91
RTX 3080 Ti $1,199.00
RTX 5070 Ti $749.00 40.41 91.25
RTX 3090 $1,499.00 46.59 25.4 89.64
RTX 4080 SUPER $1,597.00 39.12 81.45
RTX 4080 $1,349.00 39.01 81.34
RTX 3080 10GB $699.00
RTX 4070 Ti SUPER $1,148.00 36.6 78.67
RTX A6000 $3,650.00 27.93 20.51 27.57 8.68 53.25
RTX 5070 $579.00
RTX 4070 Ti $900.00
RTX 4070 SUPER $759.00
RTX 4070 $579.00
RTX 5060 Ti $430.00 25.82 58.1
RTX 3060 $250.00
RTX 4060 Ti $409.00 13.01 41.11

Table 131K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B Qwen3 32B gpt-oss 20B Llama 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX 5090 $2,499.00 49.44 37.2 52.84 112.01
RTX Pro 6000 WS $8,000.00 48.29 39.83 56.14 23.05 118.44 16.62 99.79 17.72
RTX 4090 48GB $3,100.00 33.62 24.53 33.21 67.37
RTX 4090 $2,574.00 32.27 70.27
RTX 6000 Ada $5,000.00 20.76 15.18 20.71 57.47
RTX 5080 $999.00 69.62
RTX 3090 Ti $1,499.00 29.01 66.45
RTX 3080 Ti $1,199.00
RTX 5070 Ti $749.00 64.61
RTX 3090 $1,499.00 28.09 62.18
RTX 4080 SUPER $1,597.00 60.01
RTX 4080 $1,349.00 59.95
RTX 3080 10GB $699.00
RTX 4070 Ti SUPER $1,148.00 57.47
RTX A6000 $3,650.00 15.95 12.10 14.77 33.42
RTX 5070 $579.00
RTX 4070 Ti $900.00
RTX 4070 SUPER $759.00
RTX 4070 $579.00
RTX 5060 Ti $430.00 43.77
RTX 3060 $250.00
RTX 4060 Ti $409.00 31.08

16GB GPUs performance with LLMs

The 16GB VRAM tier represents a critical sweet spot for many budget-conscious enthusiasts. It unlocks the ability to run larger and more capable models, such as 7B models with long context or 20B quantized models, without requiring a massive investment. In this competitive bracket, the price-to-performance ratio is the key metric that separates the good deals from the expensive hardware.

The NVIDIA RTX 5060 Ti 16GB stands out as the top value pick for budget builders. Priced around $430, it delivers 51.41 t/s—well above the RTX 4060 Ti 16GB’s 34.31 t/s at a similar cost. While premium cards like the RTX 4080 SUPER boast higher raw performance, their $1,200+ price tags cripple their performance-per-dollar ratio. The 5060 Ti strikes the sweet spot: enough VRAM, solid speed, and an accessible price.

👁 Token generation speeds (tokens/sec) for 16 GB GPUs running 8B, 14B, and 20B LLMs at 16k context. Each bar shows performance across model sizes, highlighting how newer 50-series cards, especially the RTX 5060 Ti, deliver strong efficiency relative to older or higher-priced GPUs.

The RTX 5070 Ti 16GB offers impressive power, but at nearly double the cost ($812), its gains—43 t/s versus 66 t/s on a 20B model—don’t justify the jump for users limited to 16GB VRAM. It’s faster, yes, but not transformative, with benefits mainly in handling extremely long prompts.

👁 This mirror bar chart compares each GPU’s purchase price (left) against its cost efficiency (right). While high-end models like the RTX 4080 SUPER command steep prices, midrange cards such as the RTX 5060 Ti and 5070 Ti deliver far better value per token generation speed — highlighting the sweet spot for local LLM builders.

Where the 5060 Ti truly shines is in a dual-GPU setup. Two cards ($860 total) deliver 32GB VRAM that is 8GB more than a single RTX 3090 for roughly the same price. This combo enables running larger 34B models, making it a smart, scalable path for enthusiasts chasing maximum model size on a budget.

The 24GB Sweet Spot

The 24GB VRAM tier is the sweet spot for a versatile single-GPU LLM setup, and the used NVIDIA RTX 3090 reigns supreme. At around $800 second-hand, it offers unmatched VRAM capacity for the price. That extra memory is what enables running 30B–32B parameter models, something simply impossible on 16GB GPUs. For enthusiasts chasing maximum model size without breaking the bank, the 3090’s price-to-VRAM ratio is unbeatable.

In terms of speed, the RTX 3090 holds its own. Even though it was released in 2020, the RTX 3090 is still a flagship card that has massive 384-bit memory bus and fast 19.5 Gbps memory deliver an impressive 936.2 GB/s of bandwidth, which is essential for fast token generation. This, combined with its generous 24 GB of VRAM, makes it an excellent choice for local LLM workloads. It hits 87.45 t/s on 8B models, nearly identical to the newer RTX 5070 Ti.

👁 Comparison chart of NVIDIA RTX 3090, 3090 Ti, 4090, and 5090 GPUs showing LLM token generation performance and price-to-performance ratios. Highlights that the RTX 5090 leads in speed, while the RTX 3090 delivers the best cost-per-token performance for local LLM inference.

The trade-off is clear: slightly less speed in exchange for vastly greater model capacity. Even on heavy loads, like a 20B model with a 131k context, the 3090 still pushes over 62 t/s, proving both capable and efficient.

The real strength of the 3090’s 24GB lies in its flexibility. It can manage extended contexts, such as an 8B model at 131k or a 30B MoE at 65k, far beyond what 16GB cards can handle. For users who prioritize long-context performance and large-model capability, the used RTX 3090 remains the smartest and most balanced single-card solution available.

The 32GB VRAM Tier

Once you move beyond 24GB of VRAM, the focus shifts from performance-per-dollar to raw capability. This is the realm of high-end consumer and select prosumer GPUs, including upcoming models like the RTX Pro 4500 Blackwell and RTX 5000 Ada. We have currently tested only the RTX 5090, which serves as the new reference point for this tier.

Although we are commenting primarily on performance, it’s important to note that 32GB sits in a bit of an in-between segment. It won’t unlock entirely new models compared to 24GB GPUs; it will load the same models, but with longer contexts. The real question is whether you want to pay extra to run Qwen3 14B, Qwen3 30B A3B, or Qwen3 32B at extended contexts such as 131K, 147K, or 45K instead of being limited to 86K, 65K, or 16K. That additional VRAM enables deeper context windows and more complex inference runs rather than new model access.

👁 A horizontal bar chart comparing the performance of the NVIDIA RTX 3090 and RTX 5090 GPUs on various large language models, including Qwen3 30B, Qwen3 32B, and gpt-oss 20B. The comparison is shown across different context sizes: 16k, 32k, and 65k. The RTX 5090 consistently demonstrates higher performance than the RTX 3090 in all benchmarked configurations.

The RTX 5090 delivers top-tier performance for a consumer card. Built on the GB202 architecture[1], it combines 32GB of ultra-fast GDDR7 memory with immense compute throughput. In our benchmarks, it reached 145 t/s on Qwen3 8B and maintained strong efficiency even on larger 30B and 32B models.

However, when directly comparing token generation speeds between the RTX 5090 and the RTX 3090 at the same context lengths, the difference is not dramatic for most workloads – the 3090 still performs very well. For instance, the RTX 3090 runs Qwen3 30B A3B MoE at 87 t/s on a 32K context, while the 5090 reaches 110 t/s. For gpt-oss 20B at 32K, the 3090 produces 112 t/s versus the 5090’s 215 t/s. The newer card is faster, but the 3090’s speeds remain entirely workable for most local inference tasks.

For users running models up to 34B, the RTX 5090 stands as the new performance leader and the most balanced choice before stepping into the inflated pricing of the prosumer segment.

The 48GB VRAM GPUs

The 48GB tier sits firmly in the prosumer category, where prices are higher but the jump in capability is substantial. This class unlocks the 70B models and includes the RTX 4090 48GB, RTX A6000, and RTX 6000 Ada desktop GPUs.

Among these, one product stands out for its uniqueness, the RTX 4090 48GB. This Chinese-modded variant repurposes standard RTX 4090 GPUs by combining functional chips with custom PCBs designed to accept double-sided memory configurations. The result is a 48GB version of the 4090 that offers exceptional value for its capacity, priced around $3,100.

In performance terms, it maintains excellent throughput, achieving 137 t/s on 30B models – just below the RTX 5090 but ahead of the RTX A6000 and RTX 6000 Ada in price-adjusted efficiency. For comparison, the RTX 6000 Ada posts 120 t/s, while the older A6000 lags further behind at 75 t/s. The 4090 48GB effectively bridges the gap between consumer and workstation hardware, delivering the cheapest path to 48GB without sacrificing much speed.

The 96GB VRAM Tier: Extreme Capacity for Large-Scale Models

At the top of the chart sits the RTX Pro 6000 96GB. This GPU represents the only viable option for users who need to load and run the largest LLMs ( 100B+ parameter range) without offloading to system memory or relying on multi-GPU setups.

👁 Bar chart comparing NVIDIA RTX 5090 and RTX Pro 6000 WS GPU token generation speeds across large language models from 8B to 123B parameters at 16k context. The RTX 5090 achieves higher throughput on smaller models, while the RTX Pro 6000 WS shows better performance scaling with larger LLMs.

In our testing, the Pro 6000 reached 28 t/s on Llama 70B and 182 t/s on gpt-oss 120B MoE workloads. Its purpose is clear: not to dominate smaller benchmarks, but to make massive models feasible for local inference. For professional creators, AI researchers, and organizations building in-house model infrastructure, it remains the single most capable single-GPU solution available.

GPU Prompt Processing Benchmarks

Prompt processing performance defines how quickly a GPU can ingest and prepare the context window before token generation begins. It’s the phase that often bottlenecks large models, especially with extended contexts or complex prompts. At Hardware Corner, we benchmarked the prompt processing (prefill) stage separately to highlight how GPU compute performance, architecture efficiency, and VRAM capacity influence throughput during model initialization and prefill.

👁 Scatter plot showing GPU performance vs VRAM for 8B model at 16k context. Larger circles represent higher GPU prices, comparing models like RTX 5090, RTX 4090, RTX 4080, and RTX Pro 6000.

These results reveal where each VRAM tier truly stands. If you’ve ever wondered why your system slows down at the start of an inference run, or which GPU tier delivers the best context-loading efficiency, this section has your answer.

Key Findings

  • RTX 5060 Ti (16GB): Best entry-level value; handles 13B–20B models efficiently for its price.
  • RTX 3090 (24GB): Top price-to-capacity pick; stable at long contexts up to 131k tokens.
  • RTX 5090 (32GB): Fastest consumer GPU; near-pro throughput at far lower cost.

GPU Benchmark Comparison Table: Prompt Processing Speed for 8B – 123B Models form 16K – 131K Context

This benchmark table compares GPUs based on their token generation speed (tokens per second) when running large language models (LLMs) at a 16K context length. Each column represents a model – from Qwen3 8B to Mistral Large 123B at 4-bit quantization– and each value indicates the GPU’s processing throughput.

16K
32K
65K
131K

Table 16K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B MoE Qwen3 32B gpt-oss 20B Llama 3.3 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX Pro 6000 $8,000.00 7587.74 5027.24 5084.45 2373.96 7478.71 1355.35 4060.65 2107.16 796.64
RTX 5090 $2,499.00 6956.11 4473.08 4669.17 2077.16 7168.10
RTX 4090 $2,574.00 6720.91 3927.70 4548.53 1684.85 6302.76
RTX 4090 48GB $3,100.00 5624.69 3471.08 4083.97 1671.58 6133.43 967.39
RTX 6000 Ada $5000.00 4096.35 2367.17 3156.50 977.52 5350.48 526.03
RTX 5080 $999.00 4024.38 2542.01 4932.33
RTX 4080 SUPER $1,597.00 3858.11 2526.69 4708.66
RTX 4080 $1,349.00 3809.77 2295.10 4329.15
RTX 5070 Ti $749.00 3653.77 2303.90 4940.71
RTX 4070 Ti SUPER $1,148.00 3050.73 2003.44 4182.67
RTX 3090 Ti $1,499.00 2834.15 1914.75 2205.58 862.69 3440.06
RTX 3080 Ti $1,199.00 2658.54 1773.87
RTX 3090 $1,499.00 2572.49 1678.68 1958.99 767.82 3243.60
RTX 4070 SUPER $759.00 2525.76 1578.12
RTX A6000 $3,650.00 2427.98 1529.00 1901.13 704.82 2649.16 238.73
RTX 3080 10GB $699.00 2287.66
RTX 4070 Ti $900.00 2274.84 1951.36
RTX 4070 $579.00 2064.25 1355.83
RTX 5070 $579.00 1600.78 1315.22
RTX 4060 Ti $409.00 1480.81 917.56 2552.88
RTX 5060 Ti $430.00 1447.92 942.60 2753.30
RTX 3060 $250.00 1119.23 678.15

Table 32K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B MoE Qwen3 32B gpt-oss 20B Llama 3.3 70B gpt-oss 120B GLM 4.5 Air 106B Mistral Large 123B
RTX Pro 6000 $8,000.00 5300.66 3536.82 3863.04 1687.05 5039.46 1008.35 3368.32 1450.34 600.76
RTX 5090 $2,499.00 3687.76 2908.42 2877.53 1451.08 5183.08
RTX 4090 $2,574.00 4223.63 2511.27 2726.03 4641.50
RTX 4090 48GB $3,100.00 3568.67 2321.91 2777.81 1179.35 4183.29
RTX 6000 Ada $5000.00 2218.45 1310.20 2010.62 595.46 3818.65
RTX 5080 $999.00 1943.27 1326.05 3653.30
RTX 4080 SUPER $1,597.00 2537.08 1769.29 3328.05
RTX 4080 $1,349.00 1968.35 1395.65 2831.05
RTX 5070 Ti $749.00 2268.97 1658.23 3903.47
RTX 4070 Ti SUPER $1,148.00 1616.55 1236.90 3020.03
RTX 3090 Ti $1,499.00 1867.96 1483.93 2521.32
RTX 3080 Ti $1,199.00 1761.17
RTX 3090 $1,499.00 1714.63 1175.71 1336.79 2547.19
RTX 4070 SUPER $759.00 1595.71
RTX A6000 $3,650.00 1593.26 1063.35 1275.59 502.73 1891.23
RTX 3080 10GB $699.00 1525.33
RTX 4070 Ti $900.00 1099.67
RTX 4070 $579.00 1116.68
RTX 5070 $579.00 898.83
RTX 4060 Ti $409.00 760.33 541.36 1964.90
RTX 5060 Ti $430.00 915.13 620.98 1737.71
RTX 3060 $250.00 764.71

Table 65K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B MoE Qwen3 32B gpt-oss 20B Llama 3.3 70B gpt-oss 120B GLM 4.6 Air 106B Mistral Large 123B
RTX Pro 6000 $8,000.00 1921.61 1378.08 2864.83 707.21 528.23 2360.34 698.67 343.07
RTX 5090 $2,499.00 2211.55 1707.24 1512.46 3019.74
RTX 4090 $2,574.00 1874.47 1545.47 1502.45 2092.21
RTX 4090 48GB $3,100.00 2097.36 1398.69 1622.76 737.88 2399.43
RTX 6000 Ada $5000.00 1152.63 727.43 913.20 348.64 2270.30
RTX 5080 $999.00 1124.63 1834.71
RTX 4080 SUPER $1,597.00 1501.49 1961.53
RTX 4080 $1,349.00 937.90 1558.98
RTX 5070 Ti $749.00 1078.71 2644.93
RTX 4070 Ti SUPER $1,148.00 829.44 1716.20
RTX 3090 Ti $1,499.00 1111.11 828.45 1666.51
RTX 3080 Ti $1,199.00
RTX 3090 $1,499.00 1014.28 734.10 1720.56
RTX 4070 SUPER $759.00
RTX A6000 $3,650.00 950.96 668.87 764.16 318.95 1207.23
RTX 3080 10GB $699.00
RTX 4070 Ti $900.00
RTX 4070 $579.00
RTX 5070 $579.00
RTX 4060 Ti $409.00 392.12 1332.19
RTX 5060 Ti $430.00 529.78 1102.29
RTX 3060 $250.00

Table 131K Context

GPU Price Qwen3 8B Qwen3 14B Qwen3 30B MoE Qwen3 32B gpt-oss 20B Llama 3.3 70B gpt-oss 120B GLM 4.6 Air 106B Mistral Large 123B
RTX Pro 6000 $8,000.00 1009.72 717.01 1270.31 330.38 266.10 1289.76 320.19
RTX 5090 $2,499.00 948.40 908.39 716.76 1636.40
RTX 4090 $2,574.00 1450.99
RTX 4090 48GB $3,100.00 1105.24 759.86 876.10 1293.24
RTX 6000 Ada $5000.00 541.27 358.58 448.63 1177.70
RTX 5080 $999.00 937.72
RTX 4080 SUPER $1,597.00 1145.01
RTX 4080 $1,349.00 898.45
RTX 5070 Ti $749.00 1363.70
RTX 4070 Ti SUPER $1,148.00 930.45
RTX 3090 Ti $1,499.00 612.37 936.39
RTX 3080 Ti $1,199.00
RTX 3090 $1,499.00 569.97 923.79
RTX 4070 SUPER $759.00
RTX A6000 $3,650.00 527.88 382.35 429.18 702.79
RTX 3080 10GB $699.00
RTX 4070 Ti $900.00
RTX 4070 $579.00
RTX 5070 $579.00
RTX 4060 Ti $409.00 731.23
RTX 5060 Ti $430.00 685.26
RTX 3060 $250.00

Prompt Processing Performance of 16GB GPUs

For local LLM enthusiasts, the 16GB VRAM tier marks a crucial step up, enabling 13B to 20B models that simply don’t fit on 8GB or 12GB cards. Here, “value” means reaching that VRAM capacity for the lowest cost possible. The RTX 5060 Ti 16GB stands out as the new entry-level benchmark, offering the memory needed to unlock larger models without a steep price tag.

While the RTX 4060 Ti 16GB once launched cheaper, it’s now harder to find and nearly the same price second-hand. Worse, NVIDIA’s limited 288 GB/s bandwidth cripples its performance, making it slower than even the RTX 3060 16GB for prompt processing. Overall, it’s a poor value next to newer alternatives like the 5060 Ti.

Across the 16GB landscape, diminishing returns are clear. Cards like the RTX 4080 or 4070 Ti SUPER deliver over twice the throughput of the 5060 Ti but cost two to three times more. For value-oriented builders, that premium is wasted on marginally faster prompt ingestion. The savings from a 5060 Ti build are better spent on system RAM, storage, or even a second GPU.

Every 16GB card (from the 5060 Ti to the 4080 SUPER) can load the same 14B model with a 32k context. The main difference lies in how fast they process that context. The RTX 5060 Ti 16GB strikes the best balance: affordable, capable, and practical. It’s the true workhorse of this VRAM tier, prioritizing accessible capacity over inflated performance premiums.

Prompt Processing Performance of 24GB GPUs

The 24GB tier is where single-GPU LLM setups become truly capable. It provides enough headroom for 30B–32B models, longer contexts, and smoother inference without resorting to multi-GPU complexity. At this level, the goal shifts from simply fitting a model to running it efficiently across large prompts and deep contexts.

The RTX 3090 defines this class. With 24GB of VRAM and a wide 384-bit bus, it balances capacity and throughput better than anything near its price. In tests with Qwen3 8B Q4_K at 16k context, it reached 2572 t/s in prompt processing. While newer cards like the RTX 4090 and 5090 are 2.6–2.7× faster, they cost over three times more, making them harder to justify for most local users.

The 3090’s real strength is stability at scale. It maintains solid speeds up to 65k context on 30B models and can stretch to 131k on 8B versions without significant drop-off. That bandwidth and VRAM combination keeps large prompt ingestion fast and prevents the stuttering common on lower-capacity cards.

For local inference, the RTX 3090 remains the practical benchmark. It’s not the newest or fastest, but it delivers where it matters, ample VRAM, steady prompt handling, and the best price-to-capacity value in the 24GB tier.

Prompt Processing Performance of 32GB+ GPUs

The 32GB+ VRAM tier represents the top end of local LLM performance, where raw capacity and bandwidth take priority over price. For now, this segment includes the RTX 5090 (32GB) and RTX PRO 6000 (96GB), as testing of upcoming 48GB GPUs is still in progress.

The RTX 5090 marks a major leap for consumer hardware. With 32GB of GDDR7 VRAM, it finally breaks past the long-standing 24GB limit. That extra memory allows dense 32B models to run with double the context (roughly 32k tokens) while maintaining excellent speed. In prompt processing, it sustains around 6,900–7,000 tokens per second on 8B models at 16k context, keeping pace with professional cards at a fraction of the cost. For most local users, the 5090 delivers unmatched performance-per-dollar and broad compatibility with large quantized models.

The RTX PRO 6000 (96GB) sits in a different class. It’s built for extreme workloads—models exceeding 120B parameters and contexts over 100k tokens. Its massive VRAM pool ensures stability and throughput even on the heaviest loads, reaching over 7,500 tokens per second on smaller benchmarks and maintaining smooth scaling at huge context lengths. However, at around $8,000, its price limits its appeal to developers or researchers who truly need the capacity.

In short, the RTX 5090 defines the peak of consumer LLM performance, while the PRO 6000 exists for those running models too large for anything else. Until the 48GB cards arrive, these two GPUs set the standard for what’s possible in the 32GB+ prompt processing class.

Conclusion

Speed is only part of the story. These benchmarks reveal clear performance tiers, but the right GPU depends entirely on how and what you plan to run. If your focus is fast, local chat or lightweight coding, a 16GB card like the RTX 5060 Ti offers remarkable efficiency and affordability. For deeper context handling or larger 30B–32B models, the 24GB tier(especially the RTX 3090) remains the most balanced and cost-effective single-GPU solution.

When performance and flexibility truly matter, the entry point for serious LLM work begins around 24GB of VRAM. At that level, you can comfortably run 30B-scale models with usable 45k-token contexts at solid generation speeds, coding, and content workflows. Beyond that, the RTX 5090 and its 32GB of GDDR7 set a new standard for consumer performance, while professional-grade cards like the RTX PRO 6000 (96GB) unlock the 70B–120B range with near-perfect prompt throughput and multi-user headroom.

Ultimately, choosing a GPU for LLMs isn’t just about tokens per second, it’s about matching the hardware to your workload. Whether you’re running a compact local assistant or a production-grade MoE system, the data makes one thing clear: speed benchmarks guide you, but context defines your real performance.

Article Resources

To keep things accurate and useful, this article pulls from a mix of resources: technical white papers, benchmark results, open datasets, and hands-on testing by the community. We also point to solid research and trusted publications when it helps explain the trade-offs and techniques around running LLMs locally.

Read more: Run LLMs Locally