📚 Related: GPU Buying Guide · GB10 Boxes Compared · VRAM Requirements · Used RTX 3090 Buying Guide
The RTX 5090 has been out long enough that the llama.cpp community has converged on real numbers — not marketing slides, not synthetic benchmarks. Token throughput, prompt processing, context scaling, head-to-head against the 4090. This guide consolidates that data into the deep single-card bench reference no one else has assembled, and anchors it against the card most local-AI builders are actually running: the used RTX 3090.
The price reality first, because it’s where most readers should start. MSRP is $1,999. Street is currently $3,500-4,300 through major US retailers — Amazon at $4,329, mid-range AIB cards $2,900-3,400, premium and liquid-cooled past $5,000, per Pangoly’s RTX 5090 price tracker, VideoCardz coverage, and the pcprice.watch eBay tracker. The driver is the ongoing memory-supply shortage: GDDR7 is roughly 78% of the card’s manufacturing cost, and the same shortage that pushed NVIDIA’s DGX Spark from $3,999 to $4,699 in February is keeping consumer card prices elevated. RTX 5090 prices are unlikely to fall below ~$3,600 in 2026 per multiple supply analyses.
So the question this guide is actually answering is split: how fast IS the 5090 on what it can run, and is the 5090’s speed worth $3,500-4,300 in mid-2026 over either a used 4090 or a used 3090? The benches answer the first; the 3090 anchor section answers the second.
The 5090 hardware reference
| Spec | RTX 5090 | RTX 4090 | RTX 3090 |
|---|---|---|---|
| VRAM | 32 GB GDDR7 | 24 GB GDDR6X | 24 GB GDDR6X |
| Memory bandwidth | 1,792 GB/s | 1,008 GB/s | 936 GB/s |
| Memory bus | 512-bit | 384-bit | 384-bit |
| CUDA cores | 21,760 | 16,384 | 10,496 |
| TGP (rated) | ~575W | 450W | 350W |
| MXFP4 / NVFP4 native | Yes | No | No |
| MSRP (launch) | $1,999 | $1,599 | $1,499 |
| Current US street | ~$3,500-4,300 | $2,200-2,600 used | ~$900-1,300 used |
The 5090’s 1,792 GB/s memory bandwidth is 1.78× the 4090’s and 1.91× the 3090’s. Token generation in LLM inference is memory-bandwidth-bound, so that ratio sets the ceiling on speed gains. The 8 GB extra VRAM over the 4090/3090 is the other story — it’s the difference between fitting Qwen 3.6-27B dense at Q4 with comfortable context vs running it tight, and it’s what makes 100K+ context benches possible at all on a single card.
What 32GB does not unlock: a clean 70B at Q4_K_M (~40 GB), Llama 4 Scout at usable quant (~55 GB), or any of the Qwen 3.5/3.6 122B-A10B / 397B-A17B MoEs. The 32GB ceiling tops out around 109B MoE (Llama 4 Scout at aggressive quant) because MoE inference activates only ~17B parameters per token. For models genuinely too big for 32GB, the GB10 Boxes Compared page covers the 128GB-unified tier (DGX Spark, Strix Halo, Mac Studio).
5090 token generation, by model and context length
Community llama.cpp benchmarks, Q4_K_M unless noted. Numbers compiled from the r/LocalLLaMA PR #22673 thread, Hardware Corner GPU benchmarks, and llmcheck.net’s RTX 5090 measurements. Exact numbers vary by build, quant variant, and prompt structure — treat as community-converged, not single-source.
| Model | VRAM used | 4K ctx (tok/s) | 8K ctx (tok/s) | 32K ctx (tok/s) |
|---|---|---|---|---|
| Qwen 3.5 / 3.6 9B Q4_K_M | ~5 GB | ~186 | ~170 | ~112 |
| Qwen 3.5 / 3.6 27B Q4_K_M | ~17 GB | ~62 | ~56 | ~44 |
| Qwen 3.6-35B-A3B Q4_K_M (MoE) | ~22 GB | ~234 | ~170 | ~111 |
| Qwen 3.5-9B at higher quant (Q8) | ~10 GB | ~120 | ~108 | ~78 |
| Llama 4 Scout Q4 (109B/17B MoE) | ~28 GB | ~95 | ~84 | ~52 |
The MoE row is the standout: ~234 tok/s on a 35B-parameter model because only 3B parameters are active per token. MoE architectures are the 5090’s best friend: big-model quality at small-model speeds, and the 32GB ceiling fits both Qwen 3.6-35B-A3B and Llama 4 Scout comfortably (where the 24GB 4090/3090 fit only the smaller one).
The dense-27B row is the practical workload row. At Q4_K_M, Qwen 3.6-27B fits with ~15 GB of headroom for context and KV cache. ~62 tok/s fresh-context falls to ~44 tok/s at 32K context — that’s normal prefill-to-decode shift, not a 5090-specific bottleneck. It’s the same pattern every bandwidth-bound card shows; the 5090 just lands the numbers higher.
5090 prompt processing (prefill), by context length
Prefill is where Blackwell’s 21,760 CUDA cores flex hardest. Community llama-bench tables on Qwen 3.5 / 3.6 era models:
| Model | 4K ctx (tok/s) | 8K ctx (tok/s) | 32K ctx (tok/s) | 65K ctx (tok/s) |
|---|---|---|---|---|
| Qwen 3.5 / 3.6 9B Q4_K_M | ~10,400 | ~8,700 | ~3,700 | ~2,200 |
| Qwen 3.5 / 3.6 27B Q4_K_M | ~2,900 | ~2,500 | ~1,450 | ~900 |
| Qwen 3.6-35B-A3B Q4_K_M (MoE) | ~6,600 | ~5,800 | ~2,900 | ~1,500 |
Sources: same llama.cpp PR thread and llmcheck.net measurements as the generation tables.
10,000+ tok/s prompt processing on a 9B-class model. That means a 4,000-token system prompt processes in under half a second — RAG workflows with long retrieved context and agentic harnesses with large tool-call payloads see massive gains here vs the 4090. This is where the 5090’s compute advantage (vs the bandwidth-only advantage on generation) actually matters in practice.
Extreme context: 131K+ on the 5090
The 5090’s 32 GB is the first consumer card that can push real benchmarks into 100K+ context territory on a 27B model. The 24 GB 4090 and 3090 OOM on the same workload — confirmed in InsiderLLM’s firsthand 3090 troubleshooting work, where a single 3090 cannot load Qwen 3.6-27B at Q4_K_M with 128K context (flash-attn KV buffer allocation fails even with Q8 KV-cache quantization).
Community-reported 5090 numbers near and past the 100K mark, Q4_K_M:
| Model | VRAM used | Context | Prompt processing (tok/s) | Token gen (tok/s) |
|---|---|---|---|---|
| Qwen 3.6-9B | ~23 GB | 131K | ~950 | ~49 |
| Qwen 3.6-27B | ~31 GB | 131K | ~910 | ~37 |
| Qwen 3.6-35B-A3B (MoE) | ~31 GB | 147K | ~670 | ~52 |
Generation speed drops as context grows because the KV cache eats both VRAM and bandwidth. ~49 tok/s at 131K on a 9B-class model is still well above reading speed and meaningful for long-document analysis. The 27B at 131K landing at ~37 tok/s is the configuration that matters for serious RAG / agentic work — and the 5090 is the only single consumer card that can do it.
5090 vs 4090: real speedup, by workload
Same-build comparisons across model sizes and context lengths. Numbers compiled from the Hardware Corner cross-card benchmark database and the llama.cpp community threads:
| Model | Quant | Context | RTX 4090 | RTX 5090 | Speedup |
|---|---|---|---|---|---|
| Qwen 3.6-27B dense | Q4_K_M | 2K | ~47 tok/s | ~62 tok/s | +32% |
| Qwen 3.6-27B dense | Q4_K_M | 32K | ~30 tok/s | ~44 tok/s | +47% |
| Qwen 3.6-35B-A3B MoE | Q4_K_M | 2K | ~180 tok/s | ~234 tok/s | +30% |
| Qwen 3.6-35B-A3B MoE | Q4_K_M | 32K | ~75 tok/s | ~111 tok/s | +48% |
| Llama 4 Scout (109B/17B MoE) | Q4 | 8K | ~57 tok/s | ~84 tok/s | +47% |
| Mixtral 8x7B | Q5_K_M | 2K | ~47 tok/s | ~58 tok/s | +24% |
The pattern: 24-50% faster on token generation, with the gap widening at longer context. The bandwidth advantage (1.78×) gets partially offset at short contexts by compute being the bottleneck, but at 32K context the KV cache load tips the balance and the bandwidth ratio dominates. On prompt processing the speedup is much larger (2-4× in some configurations) because compute compounds with bandwidth on the prefill phase.
Is the upgrade from a 4090 worth it? At $3,500-4,300 street for the 5090 vs $2,200-2,600 used for the 4090, the answer is “only if your workload heavily uses long context or you specifically need the 32 GB ceiling.” For pure chat-style 7B-27B inference at typical chat-length contexts, the +30% speedup vs a used 4090 doesn’t justify a ~$1,500 premium. For prompt-heavy RAG, agentic, or long-context coding workflows, the math shifts toward the 5090. The buying-decision logic across the full lineup lives in the GPU buying guide.
The used RTX 3090 anchor: 5090 vs the card most builders actually run
Most generic 5090 reviews compare the 5090 to the 4090 and stop. The honest value question for a local-AI builder is different: how does the 5090 compare to a used 3090 that costs ~$900-1,300?
This section is the editorial spine of the page. The 3090 numbers below are not invented or estimated — they come from InsiderLLM’s published firsthand work on Miu (a single RTX 3090, Linux, CUDA, sm_86), cited to the specific articles they appear in. The 5090 numbers above are community-sourced.
Qwen 3.6-27B Q4_K_M dense — the apples-to-apples row
| Setup | RTX 3090 (firsthand) | RTX 5090 (community) | Delta |
|---|---|---|---|
| Baseline llama.cpp, fresh context | 38 tok/s (fix-slow-qwen-3-6-27b-rtx-3090, June 10) | ~62 tok/s | +63% |
| Baseline at 32K context | 35 tok/s sustained (same) | ~44 tok/s | +26% |
| With MTP (mainline + llama-mtp fork, n=3) | 61.4 tok/s mean, 1.86× wall (wicked-fast-qwen-3-6-27b-mtp-rtx-3090, May 19) | not measured | n/a |
| With DFlash speculative decoding | 84.13 tok/s mean, 2.56× speedup (dflash-rtx-3090-bench-both-qwens, April 30) | not measured | n/a |
| Loading 128K context | OOMs (24 GB VRAM ceiling beat) (fix-slow-qwen-3-6-27b-rtx-3090) | ~37 tok/s (loads cleanly) | 5090 wins on capacity |
The honest read. On the workload where InsiderLLM has measured both sides apples-to-apples — Qwen 3.6-27B Q4_K_M dense — a stock 3090 runs at 38 tok/s fresh and 35 sustained at 32K. The 5090 runs the same workload at ~62 fresh and ~44 at 32K. That’s a real 26-63% speedup, larger at fresh context, narrower at 32K.
But the 3090’s speculative-decoding ceiling reshapes the comparison. With DFlash (custom llama.cpp fork), a 3090 hits 84 tok/s mean on the same model and quant — higher than the 5090’s stock 62 tok/s baseline at the same fresh-context length. With MTP via the llama-mtp fork, the 3090 hits 61 tok/s mean / 1.86× wall-clock speedup. The 5090 would scale similarly with the same speculative-decoding path, of course — InsiderLLM has not measured that, and won’t claim it. The point is: on a like-for-like dense 27B coding workload, a $1,000 used 3090 with the right software stack is faster than a $3,500-4,300 5090 running stock llama.cpp.
Where the 5090 genuinely wins on this workload: capacity headroom for 100K+ context. A 3090 cannot load Qwen 3.6-27B at 128K context — flash-attn KV allocation OOMs. The 5090’s 32 GB takes the same model to 131K cleanly. For long-context RAG and agentic harnesses, the 5090 is the only single consumer card that does the job.
Other models — no firsthand 3090 measurement, no manufactured comparison
For the other 5090 community benches above (Qwen 9B, 35B-A3B MoE, Llama 4 Scout, Mixtral, gpt-oss 120B), InsiderLLM has not measured equivalent 3090 numbers firsthand. Rather than manufacture matching figures to complete clean comparison tables — the exact failure mode the InsiderLLM honesty discipline exists to prevent — those rows are intentionally left out of the apples-to-apples table above. For community-reported 3090 numbers on those models, Hardware Corner’s RTX 3090 benchmark database is the cleanest external reference.
The point: the published 3090 firsthand work covers Qwen 3.5/3.6-27B dense deeply. That’s where the honest 5090-vs-3090 comparison lives. For the rest of the lineup, treat the 5090 community numbers above as the bench reference and the 3090 column as “see external trackers for that specific model.”
NVFP4 / MXFP4 native support: what it actually buys you
Blackwell’s tensor cores natively support 4-bit floating point (NVFP4 / MXFP4) — the 5090 is the first consumer card to do this in hardware. The practical impact for most readers is smaller than the marketing suggests.
For TensorRT-LLM and vLLM running production inference at batch, NVFP4 is meaningful. The hardware path delivers genuine throughput wins on long batches at the kernel level that you cannot replicate via software quantization on Ampere or Ada hardware.
For llama.cpp users (which is most local-AI builders), the impact is minimal. llama.cpp uses its own GGML quantization formats (Q4_K_M, Q5_K_S, etc.) which already achieve similar compression ratios at single-user, single-request workflow with good quality. The 5090 still runs Q4_K_M GGUFs faster than the 4090 because of bandwidth, not because of NVFP4. Until llama.cpp grows native NVFP4 support across its quantization stack, this advantage doesn’t surface for the typical local-LLM user.
The honest framing: NVFP4 matters for vLLM-on-batch and TensorRT-LLM workloads; treat it as roughly neutral for llama.cpp / Ollama use cases until kernel support catches up.
The bandwidth rule of thumb
A simple formula that predicts roughly every result in this guide:
tok/s ≈ (memory bandwidth in GB/s) × (efficiency factor) / (model file size in GB)
Efficiency factor varies by stack:
- NVIDIA CUDA via llama.cpp / vLLM: ~0.12-0.14
- AMD Vulkan via llama.cpp: ~0.08-0.10
- Unified memory (DGX Spark, Strix Halo, Apple Silicon): ~0.06-0.10 depending on the memory controller
Worked examples:
- RTX 5090 on Qwen 3.5-9B Q4 (~5 GB): 1,792 × 0.13 / 5 = ~46.6 tok/s per GB × 5 GB-worth-of-attention-per-token simplification ≈ ~190 tok/s prediction. Measured: ~186. Close.
- RTX 5090 on Qwen 3.6-27B Q4 (~17 GB): 1,792 × 0.13 / 17 ≈ ~14 tok/s per GB × ~5 GB attention-pass-equivalent ≈ ~70 tok/s prediction. Measured: ~62. Within tolerance.
- DGX Spark on Qwen 3.6-27B Q4 (~17 GB): 273 × 0.08 / 17 ≈ ~1.3 tok/s per GB × similar attention pass ≈ ~6-8 tok/s prediction. Real-world for that workload is also in the single digits.
Bandwidth is destiny for local LLM token generation. Everything else — quant choice, framework optimization, KV cache layout — is a multiplier on top. The 5090’s 1,792 GB/s explains every result in this guide; the 3090’s 936 GB/s explains why its baseline is ~62% of the 5090’s at the same workload.
What this guide routes elsewhere
Three things the local-AI cluster has fresher dedicated coverage on:
- 5090 vs DGX Spark / Strix Halo / Mac Studio (128GB-unified tier) → GB10 Boxes Compared. The 5090 has roughly 6.5× the GB10’s bandwidth (1,792 vs 273 GB/s) and 7× Strix Halo’s, but loses the capacity comparison hard — 32 GB doesn’t fit a clean 70B, while the 128GB-unified tier does. That page has the full field including current prices and the used-3090 reality check for the 128GB-unified decision.
- 5090 vs Mac Studio M3 Ultra / M5 Pro/Max → Apple M5 Pro/Max for Local AI. The M5 Max at 614 GB/s is the highest-bandwidth Apple option, and the M3 Ultra’s 819 GB/s + 128GB unified memory is the genuine 5090 competitor on large-model capacity workloads. CUDA vs MLX is the deal-breaker if your toolchain depends on either.
- 5090 vs the broader buying decision (3060, 3090, 4090, 5060 Ti, Intel B70) → GPU Buying Guide. That page has the budget tiers, used-market navigation (eBay sniping, fair prices), and the across-the-lineup decision tree. This guide is the deep 5090 bench reference; the buying decision lives there.
For per-model VRAM math (which model fits in what), the VRAM Requirements guide is the canonical reference.
The bottom line
The RTX 5090 is the fastest single consumer GPU for local AI in 2026. At MSRP ($1,999), it’s the clear high-end pick. At current street prices ($3,500-4,300 due to the DRAM shortage), the value equation depends entirely on what you actually run:
- For Qwen 3.6-27B-class dense workloads at typical chat-length contexts: a used 3090 ($900-1,300) with the right software stack (DFlash speculative decoding gets a stock 3090 to 84 tok/s mean, per InsiderLLM’s firsthand bench on Miu) outruns a stock 5090. Don’t overspend.
- For 100K+ context RAG, long-document analysis, or agentic harnesses with deep tool-call payloads: the 5090’s 32 GB is the only single-card option. A 3090 OOMs at 128K on Qwen 3.6-27B; the 5090 handles 131K cleanly.
- For MoE workloads (Qwen 3.6-35B-A3B, Llama 4 Scout): the 5090’s bandwidth advantage compounds with MoE’s sparse activation pattern — ~234 tok/s on 35B-A3B at fresh context is real. Both cards run these models; the 5090 just lands the numbers higher.
- For models genuinely too big for 32 GB (clean 70B Q4, Qwen 3.5-122B-A10B, anything 100B+ dense): the 5090 cannot help. The 128GB-unified tier comparison in GB10 Boxes Compared is the relevant decision.
The unchanged honest answer for most readers hasn’t moved: figure out what you actually run, check the VRAM math against your specific model in the VRAM Requirements guide, and buy the cheapest card that fits that workload. The bench data above tells you exactly how fast each option will be. The InsiderLLM 3090 firsthand work tells you what the floor of “fast enough” actually looks like on the card most builders are running.
Related Guides
- GPU Buying Guide for Local AI — the buying decision across the lineup
- GB10 Boxes Compared: vs Strix Halo, vs Used 3090 — the 128GB-unified tier alternative
- Apple M5 Pro/Max for Local AI (2026) — the high-bandwidth Apple option
- VRAM Requirements for Local LLMs — per-model VRAM math
- Used RTX 3090 Buying Guide — the InsiderLLM-recommended value pick
- Firsthand RTX 3090 work (InsiderLLM, Miu, sourced inline above):
- Fix Slow Qwen 3.6-27B on RTX 3090 — baseline 38 tok/s, MTP path
- DFlash on RTX 3090: I Built It and Tested It (Both Qwens) — 2.56-2.59× DFlash speedup
- DFlash vs MTP on RTX 3090: Head-to-Head — 2.56× vs 1.50× same-card comparison
- Wicked Fast Qwen 3.6-27B with MTP on RTX 3090 — 1.86× wall-clock MTP speedup
- Llama 4 Guide: Scout and Maverick
- Qwen 3.6 Local Guide
- llama.cpp vs Ollama vs vLLM (2026)
Get notified when we publish new guides.
Subscribe — free, no spam