VOOZH about

URL: https://www.hardware-corner.net/llm-vram-usage-compared/

⇱ LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts


LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts

By Allan Witt | Updated: November 9, 2025

As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of today’s most popular models to see exactly how much VRAM they consume at different context lengths. This isn’t theoretical; this is a practical guide to help you figure out what your hardware can actually handle.

I ran these tests on Ubuntu 24.04 machine with CUDA 12.8 and PyTorch 2.8.0. For the inference engine, I used llama.cpp (llama-bench with FlashAttention[1]), as it’s the backbone for many frontends like Ollama and LM Studio. All models were tested with a 4-bit Q4_K_XL quantization[2]. I tested nine popular models – from Qwen3 8B up to Mistral Large 123B – gradually increasing their context from 4K until I reached the model’s maximum limit or ran out of VRAM.

My Benchmark Data

The table below shows the raw results of my testing. I’ve laid out the VRAM required in gigabytes for each model at specific context lengths. I also highlighted rows corresponding to popular GPU VRAM capacities—12GB, 16GB, 24GB, and so on—so you can quickly map my findings to your current or future rig.

VRAM Legend:12 GB – RTX 306016 GB – RTX 5060 Ti24 GB – RTX 309032 GB – RTX 509048 GB – A6000 Ada96 GB – RTX Pro 6000
VRAM (GB) Qwen3 8b Qwen3 14b Qwen3 30B A3B Qwen3 32B gpt-oss 20b Llama 3.3 70B gpt-oss 120b GLM 4.5 Air 106B Mistral Large 123B
5 4K
6 8K
7 16k
9 32K 4K
10 8K
11 45K 16K 2K
12 4K
13 57K
14 65K 32K 86K
15 70K 45K 131K
17 86K 4K
18 57K 8K 4K
19 65K 16K
20 32K
21 45K
22 8K
23 131K 86K 65K 16K
24
25 86K
28 32K
30 131K
31 131K 147K 45K
34 57K
36 65K
39 226K 78K
40
41 86K
42 262K 4K
47 110K 16K
52 131K 40K
60 57K 1K
63 86K
65 131K
68 4K
71 20K 2K
79 60K 20K
82 131K
91 131K
95 72K

VRAM Tier Analysis: What I Found Your GPU Can Run

My benchmark data reveals clear tiers of capability based on GPU VRAM. Here’s my practical breakdown of what you can expect from common hardware configurations.

The 12GB Tier (e.g., RTX 3060)

From my tests, a 12GB card is a restrictive starting point. It handles 8B models with a generous 45k context and can push 14B models to 16k, but it hits a wall quickly after that. I could barely load the gpt-oss 20B model with a minimal 2k context. This tier works for smaller models or tasks that don’t need a deep context history. You can squeeze out a bit more by quantizing the KV cache, but you’re always operating on the edge.

The 16GB Tier (e.g., RTX 5060 Ti 16GB)

When I tested with 16GB, the options opened up a bit. This amount of VRAM was enough to comfortably max out the context length on the gpt-oss 20B model, which topped out at 131k tokens and used about 15GB. However, you’re still locked out of the 30B and 32B class models. I see a 16GB card as a solid choice if you plan to stick with models in the 7B to 20B range but want the freedom to use very long contexts.

The 24GB Tier (e.g., RTX 3090, RTX 4090)

Based on my data, this is the definitive sweet spot for a versatile single-GPU local setup. With 24GB of VRAM, I was able to run a much wider range of models with significant context lengths. This tier handles five of the nine models I tested without issue. I got the Qwen3 30B MoE running with a 65k context and the dense 32B model with a respectable 16k context. For price-to-performance, my analysis shows a 24GB card provides the best balance, letting you run capable mid-size models without the complexity of a multi-GPU rig.

The 32GB (RTX 5090)

Once you move past 24GB, you are primarily paying for the ability to push context window. On a 32GB card, for instance, I could max out the Qwen3 14B model’s context and hit an impressive 147K context with the 30B MoE variant. This is where things get interesting for developers doing long document analysis or complex coding tasks.

However, 32GB sits in a bit of an in-between segment. In the current model environment, it does not unlock access to larger models since the model set remains the same as with 24GB. The real advantage is in context length. With 32GB, you can take models like Qwen3 14B, Qwen3 30B A3B, and Qwen3 32B, which might run at 86K, 65K, and 16K contexts on 24GB, and extend them up to around 131K, 147K, and 45K contexts respectively. In practice, this added memory translates into the ability to handle longer sequences, larger documents, and more complex inference tasks without swapping or truncation.

The 96GB Power User Tier (RTX Pro 6000)

If you have access to 96GB of VRAM, you are golden. This configuration handled almost every model I threw at it, running them at their maximum context length. The only exception was Mistral Large 123B, which still achieved a very usable 72k context. This is the domain of the serious enthusiast or professional who needs to run the largest open-source models available without compromise.

Conclusion

After running all these tests, my conclusion is clear. Think of a 24GB GPU not as the destination, but as the starting line. It’s the entry point for running a diverse set of modern architectures with meaningful context, but it’s just the beginning of what’s possible as you climb the VRAM ladder.

Every step up in VRAM unlocks a significant leap in model quality and capability. My tests show that with 48GB, you’re not just running 70B models—you’re running them with massive context windows. At 65GB, you can max out the context on the gpt-oss 120B. At 82GB, the full 131K context of Llama 3.3 70B is yours, and by 91GB, you’re running powerful models like the GLM 4.5 Air 106B at their peak performance.

We need to keep this upward trend in mind. Models on the horizon like Qwen3 235B, GLM-4.6, and DeepSeek V3.1 are closing the gap with proprietary SOTA models like Claude and GPT-4. To run these, we move beyond single consumer GPUs and into the land of multi-GPU server setups and unified memory. This is a territory I plan to explore, so expect benchmarks on those setups in the near future.

To-Do / Upcoming Tests:

  • Benchmark mid-range models: Qwen3 24B and 27B
  • Test larger next-gen models:
    • Qwen3 Next 80B A3B
    • Qwen3 235B A22B
    • GLM-4.6 355B
    • DeepSeek V3.1 671B
Article Resources

To keep things accurate and useful, this article pulls from a mix of resources: technical white papers, benchmark results, open datasets, and hands-on testing by the community. We also point to solid research and trusted publications when it helps explain the trade-offs and techniques around running LLMs locally.

Read more: Run LLMs Locally