VOOZH about

URL: https://willitrunai.com/blog/what-can-you-run-on-16gb-24gb-32gb-vram

⇱ What Can You Run on 16GB, 24GB, 32GB VRAM? — Local LLM Guide (April 2026) | Will It Run AI Blog


Short answer: 24 GB VRAM is the 2026 sweet spot. 16 GB still runs a great lineup, 32 GB adds headroom for Q5/Q6 and 1M-context Qwen 3.6 workloads.

This guide gives the exact local LLMs that fit on 16 GB, 24 GB, and 32 GB VRAM in April 2026 — per-tier top picks, tokens/second on common GPUs, coding vs chat picks, and when upgrading is actually worth it.

Need a fit check for a specific GPU? Use the VRAM Calculator. For a ranked list against a specific Apple Silicon Mac, see Best Local LLMs for MacBook Air M4 24GB or MacBook Pro M4 Pro 24GB.

16 GB VRAM — Mid-range (RTX 4060 Ti 16GB, RTX 4070 Ti 16GB, RTX 5070, RTX 4080 16GB, RX 7900 GRE)

The sweet spot for dense 9-27B models. Qwen3.6-27B (released April 22, 2026) fits at Q4_K_M (16.8 GB) and is the best model at this tier.

Use caseModelVRAM Q4Best quanttok/s (RTX 4070 Ti)
Best overall (NEW)Qwen3.6-27B16.8 GBQ4_K_M~38
Best coding / agenticQwen3.6-27B16.8 GBQ4_K_M~38
General chatQwen 3.5 9B~5.1 GBQ8_0 (~10 GB)~70
Coding (small / fast)Qwen 3 Coder 14B~8.3 GBQ6_K (~12 GB)~50
Previous-gen denseQwen 3.5 27B~16 GBQ4_K_M tight~38
Instruction-followLlama 3.1 8B~4.6 GBQ8_0 (~8 GB)~80
Reasoning / MathDeepSeek R1 Distill 14B~8 GBQ5_K_M~60

Does NOT fit at useful quant: Qwen3.6-35B-A3B MoE (~21 GB), Qwen 3.5 35B-A3B (~21 GB), DeepSeek R1 32B full, any Llama 4 variant.

Upgrade trigger: If you want MoE efficiency or long-context agentic workloads, jump to 24 GB.

24 GB VRAM — Enthusiast (RTX 4090, RTX 3090, RTX 5090 32GB, RX 7900 XTX, Mac M4 Pro 24GB)

The 2026 sweet spot. Handles the best MoE models, top dense 27-32B, long 262K context.

Use caseModelVRAM Q4Best quanttok/s (RTX 4090)
Best coding / flagship (NEW)Qwen3.6-27B dense16.8 GBQ6_K (22.5 GB)~60
Best MoE throughputQwen3.6-35B-A3B~21 GBQ4_K_M~70
Best coding (prev-gen)Qwen 3 Coder 30B-A3B~17 GBQ4_K_M~75
Dense reasoningQwen 3 32B~19 GBQ4_K_M~55
Prev-gen MoEQwen 3.5 35B-A3B~21 GBQ4_K_M~70
Math / CodeDeepSeek R1 Distill 32B~19 GBQ4_K_M~50

Does NOT fit at useful quant: Llama 4 Maverick (requires 128GB), DeepSeek V3 full, Qwen 3.5 122B-A10B (needs 80GB).

Upgrade trigger: If you need Q6/Q8 on 35B-A3B (for coding precision) or 1M-context workflows, jump to 32 GB or Mac 36-64 GB.

32 GB VRAM — High-end consumer (RTX 5090 32GB)

Q5/Q6 on 35B-A3B, partial offload to Llama 4 Scout, Qwen 3.6 1M context.

Use caseModelVRAMBest quanttok/s (RTX 5090)
Best coding (NEW)Qwen3.6-27B dense~28.6 GBQ8_0~85
Best MoEQwen3.6-35B-A3B~25 GBQ5_K_M~90
Best prev-gen codingQwen 3 Coder 30B-A3B~20 GBQ6_K~85
Long contextQwen 3.5 27B (128K)~18 GBQ8_0~55
Dense reasoningQwen 3 32B~23 GBQ5_K_M~60
Partial offloadLlama 4 Scout 109B~28 GB (partial)Q4 offload~15
1M-context (Qwen 3.6)Qwen3.6-35B-A3B~21-40 GBQ4-Q5 YaRN~90

Upgrade trigger: For 35B-A3B at Q8 (effectively FP16 quality) or multi-model concurrent use, go to 48 GB+ (RTX A6000, Mac M4 Max 64GB).

Apple Silicon unified memory equivalents

Unified memory behaves differently: macOS reserves ~15-25% for system. Effective "LLM headroom":

Mac configEffective LLM RAMClosest GPU tier
MacBook Air M4 16GB~12 GBRTX 4060 Ti 16GB
MacBook Air M4 24GB~19 GBbetween 16 and 24 GB tiers
MacBook Pro M4 24GB~19 GBbetween 16 and 24 GB
MacBook Pro M4 Pro 24GB~20 GB24 GB class (higher bandwidth)
MacBook Pro M4 Max 36GB~30 GB32 GB class
MacBook Pro M4 Max 48GB~40 GB32-48 GB class
MacBook Pro M4 Max 64GB~54 GB48 GB class
Mac Studio M3 Ultra 96GB~80 GBworkstation

See MacBook Air M4 vs Pro M4 for Local LLMs for the full decision guide.

Expected tokens per second (Q4_K_M)

ModelRTX 4060 Ti 16GBRTX 4090 24GBRTX 5090 32GBMac M4 Pro 24GBMac M4 Max 64GB
Qwen 3 8B50851104045
Qwen 3.5 9B50851104045
Qwen 3 14B3555702228
Qwen 3.5 27B35451824
Qwen 3 30B-A3B MoE70903442
Qwen 3.5 35B-A3B65853040
DeepSeek R1 Distill 32B45602026

Decision framework

  • You type slowly or mostly chat: 16 GB is fine, stick with Qwen 3.5 9B Q8.
  • You code all day: 24 GB (RTX 4090/3090) + Qwen 3 Coder 30B-A3B. Best ROI.
  • You want MoE + long context: 32 GB RTX 5090 or Mac M4 Max 36-48 GB.
  • You run a team workstation: 48-64 GB (Mac Studio / Mac Pro M4 Max) for Q8 on 35B-A3B.
  • You run API for multiple users: skip consumer GPUs; go H100 80GB or datacenter multi-GPU (see Multi-GPU LLM Inference Guide).

Related guides

Frequently Asked Questions