Voozh

As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of today’s most popular models to see exactly how much VRAM they consume at different context lengths. This isn’t theoretical; this is a practical guide to help you figure out what your hardware can actually handle.

I ran these tests on Ubuntu 24.04 machine with CUDA 12.8 and PyTorch 2.8.0. For the inference engine, I used llama.cpp (llama-bench with FlashAttention^[1]), as it’s the backbone for many frontends like Ollama and LM Studio. All models were tested with a 4-bit Q4_K_XL quantization^[2]. I tested nine popular models – from Qwen3 8B up to Mistral Large 123B – gradually increasing their context from 4K until I reached the model’s maximum limit or ran out of VRAM.

My Benchmark Data

The table below shows the raw results of my testing. I’ve laid out the VRAM required in gigabytes for each model at specific context lengths. I also highlighted rows corresponding to popular GPU VRAM capacities—12GB, 16GB, 24GB, and so on—so you can quickly map my findings to your current or future rig.

VRAM Legend:12 GB – RTX 306016 GB – RTX 5060 Ti24 GB – RTX 309032 GB – RTX 509048 GB – A6000 Ada96 GB – RTX Pro 6000

VRAM (GB)	Qwen3 8b	Qwen3 14b	Qwen3 30B A3B	Qwen3 32B
5	4K
6	8K
7	16k
9	32K	4K
10	8K
11	45K	16K	2K
12	4K
13	57K
14	65K	32K	86K
15	70K	45K	131K
17	86K	4K
18	57K	8K	4K
19	65K	16K
20	32K
21	45K
22	8K
23	131K	86K	65K	16K
24
25	86K
28	32K
30	131K
31	131K	147K	45K
34	57K
36	65K
39	226K	78K
40
41	86K
42	262K	4K
47	110K	16K
52	131K	40K
60	57K	1K
63	86K
65	131K
68	4K
71	20K	2K
79	60K	20K
82	131K
91	131K
95	72K

VRAM Tier Analysis: What I Found Your GPU Can Run

My benchmark data reveals clear tiers of capability based on GPU VRAM. Here’s my practical breakdown of what you can expect from common hardware configurations.

The 12GB Tier (e.g., RTX 3060)

From my tests, a 12GB card is a restrictive starting point. It handles 8B models with a generous 45k context and can push 14B models to 16k, but it hits a wall quickly after that. I could barely load the gpt-oss 20B model with a minimal 2k context. This tier works for smaller models or tasks that don’t need a deep context history. You can squeeze out a bit more by quantizing the KV cache, but you’re always operating on the edge.

The 16GB Tier (e.g., RTX 5060 Ti 16GB)

When I tested with 16GB, the options opened up a bit. This amount of VRAM was enough to comfortably max out the context length on the gpt-oss 20B model, which topped out at 131k tokens and used about 15GB. However, you’re still locked out of the 30B and 32B class models. I see a 16GB card as a solid choice if you plan to stick with models in the 7B to 20B range but want the freedom to use very long contexts.

The 24GB Tier (e.g., RTX 3090, RTX 4090)

Based on my data, this is the definitive sweet spot for a versatile single-GPU local setup. With 24GB of VRAM, I was able to run a much wider range of models with significant context lengths. This tier handles five of the nine models I tested without issue. I got the Qwen3 30B MoE running with a 65k context and the dense 32B model with a respectable 16k context. For price-to-performance, my analysis shows a 24GB card provides the best balance, letting you run capable mid-size models without the complexity of a multi-GPU rig.

The 32GB (RTX 5090)

Once you move past 24GB, you are primarily paying for the ability to push context window. On a 32GB card, for instance, I could max out the Qwen3 14B model’s context and hit an impressive 147K context with the 30B MoE variant. This is where things get interesting for developers doing long document analysis or complex coding tasks.

However, 32GB sits in a bit of an in-between segment. In the current model environment, it does not unlock access to larger models since the model set remains the same as with 24GB. The real advantage is in context length. With 32GB, you can take models like Qwen3 14B, Qwen3 30B A3B, and Qwen3 32B, which might run at 86K, 65K, and 16K contexts on 24GB, and extend them up to around 131K, 147K, and 45K contexts respectively. In practice, this added memory translates into the ability to handle longer sequences, larger documents, and more complex inference tasks without swapping or truncation.

The 96GB Power User Tier (RTX Pro 6000)

If you have access to 96GB of VRAM, you are golden. This configuration handled almost every model I threw at it, running them at their maximum context length. The only exception was Mistral Large 123B, which still achieved a very usable 72k context. This is the domain of the serious enthusiast or professional who needs to run the largest open-source models available without compromise.

Conclusion

After running all these tests, my conclusion is clear. Think of a 24GB GPU not as the destination, but as the starting line. It’s the entry point for running a diverse set of modern architectures with meaningful context, but it’s just the beginning of what’s possible as you climb the VRAM ladder.

Every step up in VRAM unlocks a significant leap in model quality and capability. My tests show that with 48GB, you’re not just running 70B models—you’re running them with massive context windows. At 65GB, you can max out the context on the gpt-oss 120B. At 82GB, the full 131K context of Llama 3.3 70B is yours, and by 91GB, you’re running powerful models like the GLM 4.5 Air 106B at their peak performance.

We need to keep this upward trend in mind. Models on the horizon like Qwen3 235B, GLM-4.6, and DeepSeek V3.1 are closing the gap with proprietary SOTA models like Claude and GPT-4. To run these, we move beyond single consumer GPUs and into the land of multi-GPU server setups and unified memory. This is a territory I plan to explore, so expect benchmarks on those setups in the near future.

To-Do / Upcoming Tests:

Benchmark mid-range models: Qwen3 24B and 27B
Test larger next-gen models:
- Qwen3 Next 80B A3B
- Qwen3 235B A22B
- GLM-4.6 355B
- DeepSeek V3.1 671B

URL: https://www.hardware-corner.net/llm-vram-usage-compared/

⇱ LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts

LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts