LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts
By Allan Witt | Updated: November 9, 2025
As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of today’s most popular models to see exactly how much VRAM they consume at different context lengths. This isn’t theoretical; this is a practical guide to help you figure out what your hardware can actually handle.
I ran these tests on Ubuntu 24.04 machine with CUDA 12.8 and PyTorch 2.8.0. For the inference engine, I used llama.cpp (llama-bench with FlashAttention[1]), as it’s the backbone for many frontends like Ollama and LM Studio. All models were tested with a 4-bit Q4_K_XL quantization[2]. I tested nine popular models – from Qwen3 8B up to Mistral Large 123B – gradually increasing their context from 4K until I reached the model’s maximum limit or ran out of VRAM.
My Benchmark Data
The table below shows the raw results of my testing. I’ve laid out the VRAM required in gigabytes for each model at specific context lengths. I also highlighted rows corresponding to popular GPU VRAM capacities—12GB, 16GB, 24GB, and so on—so you can quickly map my findings to your current or future rig.
| VRAM (GB) | Qwen3 8b | Qwen3 14b | Qwen3 30B A3B | Qwen3 32B | gpt-oss 20b | Llama 3.3 70B | gpt-oss 120b | GLM 4.5 Air 106B | Mistral Large 123B |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 4K | ||||||||
| 6 | 8K | ||||||||
| 7 | 16k | ||||||||
| 9 | 32K | 4K | |||||||
| 10 | 8K | ||||||||
| 11 | 45K | 16K | 2K | ||||||
| 12 | 4K | ||||||||
| 13 | 57K | ||||||||
| 14 | 65K | 32K | 86K | ||||||
| 15 | 70K | 45K | 131K | ||||||
| 17 | 86K | 4K | |||||||
| 18 | 57K | 8K | 4K | ||||||
| 19 | 65K | 16K | |||||||
| 20 | 32K | ||||||||
| 21 | 45K | ||||||||
| 22 | 8K | ||||||||
| 23 | 131K | 86K | 65K | 16K | |||||
| 24 | |||||||||
| 25 | 86K | ||||||||
| 28 | 32K | ||||||||
| 30 | 131K | ||||||||
| 31 | 131K | 147K | 45K | ||||||
| 34 | 57K | ||||||||
| 36 | 65K | ||||||||
| 39 | 226K | 78K | |||||||
| 40 | |||||||||
| 41 | 86K | ||||||||
| 42 | 262K | 4K | |||||||
| 47 | 110K | 16K | |||||||
| 52 | 131K | 40K | |||||||
| 60 | 57K | 1K | |||||||
| 63 | 86K | ||||||||
| 65 | 131K | ||||||||
| 68 | 4K | ||||||||
| 71 | 20K | 2K | |||||||
| 79 | 60K | 20K | |||||||
| 82 | 131K | ||||||||
| 91 | 131K | ||||||||
| 95 | 72K |
VRAM Tier Analysis: What I Found Your GPU Can Run
My benchmark data reveals clear tiers of capability based on GPU VRAM. Here’s my practical breakdown of what you can expect from common hardware configurations.
The 12GB Tier (e.g., RTX 3060)
From my tests, a 12GB card is a restrictive starting point. It handles 8B models with a generous 45k context and can push 14B models to 16k, but it hits a wall quickly after that. I could barely load the gpt-oss 20B model with a minimal 2k context. This tier works for smaller models or tasks that don’t need a deep context history. You can squeeze out a bit more by quantizing the KV cache, but you’re always operating on the edge.
The 16GB Tier (e.g., RTX 5060 Ti 16GB)
When I tested with 16GB, the options opened up a bit. This amount of VRAM was enough to comfortably max out the context length on the gpt-oss 20B model, which topped out at 131k tokens and used about 15GB. However, you’re still locked out of the 30B and 32B class models. I see a 16GB card as a solid choice if you plan to stick with models in the 7B to 20B range but want the freedom to use very long contexts.
The 24GB Tier (e.g., RTX 3090, RTX 4090)
Based on my data, this is the definitive sweet spot for a versatile single-GPU local setup. With 24GB of VRAM, I was able to run a much wider range of models with significant context lengths. This tier handles five of the nine models I tested without issue. I got the Qwen3 30B MoE running with a 65k context and the dense 32B model with a respectable 16k context. For price-to-performance, my analysis shows a 24GB card provides the best balance, letting you run capable mid-size models without the complexity of a multi-GPU rig.
The 32GB (RTX 5090)
Once you move past 24GB, you are primarily paying for the ability to push context window. On a 32GB card, for instance, I could max out the Qwen3 14B model’s context and hit an impressive 147K context with the 30B MoE variant. This is where things get interesting for developers doing long document analysis or complex coding tasks.
However, 32GB sits in a bit of an in-between segment. In the current model environment, it does not unlock access to larger models since the model set remains the same as with 24GB. The real advantage is in context length. With 32GB, you can take models like Qwen3 14B, Qwen3 30B A3B, and Qwen3 32B, which might run at 86K, 65K, and 16K contexts on 24GB, and extend them up to around 131K, 147K, and 45K contexts respectively. In practice, this added memory translates into the ability to handle longer sequences, larger documents, and more complex inference tasks without swapping or truncation.
The 96GB Power User Tier (RTX Pro 6000)
If you have access to 96GB of VRAM, you are golden. This configuration handled almost every model I threw at it, running them at their maximum context length. The only exception was Mistral Large 123B, which still achieved a very usable 72k context. This is the domain of the serious enthusiast or professional who needs to run the largest open-source models available without compromise.
Conclusion
After running all these tests, my conclusion is clear. Think of a 24GB GPU not as the destination, but as the starting line. It’s the entry point for running a diverse set of modern architectures with meaningful context, but it’s just the beginning of what’s possible as you climb the VRAM ladder.
Every step up in VRAM unlocks a significant leap in model quality and capability. My tests show that with 48GB, you’re not just running 70B models—you’re running them with massive context windows. At 65GB, you can max out the context on the gpt-oss 120B. At 82GB, the full 131K context of Llama 3.3 70B is yours, and by 91GB, you’re running powerful models like the GLM 4.5 Air 106B at their peak performance.
We need to keep this upward trend in mind. Models on the horizon like Qwen3 235B, GLM-4.6, and DeepSeek V3.1 are closing the gap with proprietary SOTA models like Claude and GPT-4. To run these, we move beyond single consumer GPUs and into the land of multi-GPU server setups and unified memory. This is a territory I plan to explore, so expect benchmarks on those setups in the near future.
To-Do / Upcoming Tests:
- Benchmark mid-range models: Qwen3 24B and 27B
- Test larger next-gen models:
- Qwen3 Next 80B A3B
- Qwen3 235B A22B
- GLM-4.6 355B
- DeepSeek V3.1 671B
To keep things accurate and useful, this article pulls from a mix of resources: technical white papers, benchmark results, open datasets, and hands-on testing by the community. We also point to solid research and trusted publications when it helps explain the trade-offs and techniques around running LLMs locally.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Unsloth AI: Unsloth - Dynamic 4-bit Quantization
