VOOZH about

URL: https://willitrunai.com/calculator

⇱ LLM VRAM Calculator — Check If Your GPU Can Run AI Models | Will It Run AI


Will It Run AI · Calculator

Tell us what you own and what you want to do. We will rank the local models that make sense.

Start from your hardware and workload, then get a shortlist based on fit, speed, and runtime support instead of guessing from generic model lists or benchmark screenshots.

Live catalog snapshot: 196 hardware profiles, 374 models, 24 runtimes. That keeps the calculator aligned with the current catalog instead of a static benchmark list.

Now evaluating

RTX 4070 12GB

Workload

Coding

Runtime

llama.cpp

Operating mode

Balanced

Inputs

Pick the hardware, runtime, and workload you want to test.

Use the detected hardware if it is right, override it if it is not, and rerun the ranking to compare realistic local AI options.

1. Fit

Memory fit and headroom decide whether a model is realistic on the selected hardware.

2. Workload

The score rewards models that match the selected task and penalizes stale or legacy families when newer specialist releases exist.

3. Speed

Decode throughput and TTFT keep the shortlist practical for real usage, not just theoretically possible runs.

Qwen

👁 Alibaba

Qwen 3.5 9B

FrontierReleased Jun 2025Hugging FaceOllamaLM Studio

Why it wins

Qwen 3.5 9B is a specialized fit for Coding. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #1
SRunsMEASURED

Score

122.0

Fit status

Runs well

Fit: Runs well with 32K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

All 374 models

Full compatibility grid for RTX 4070 12GB

244 models fit · 9 excellent · 37 great

Grade
Model
Params
Tasks
Q4 VRAM
Decode
Context
Memory
Fit

Quant

q4-k-m

Decode

72 tok/s

Safe ctx

32K

Official ctx

131K

Support

native

TTFT

2616 ms

Weights: 5.5 GB

KV cache: 2.2 GB

Backend: cuda-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 122.0 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

CodeGeeX

👁 Tsinghua/Zhipu

CodeGeeX 4 9B

CurrentReleased Jul 2024Hugging FaceOllama

Why it wins

CodeGeeX 4 9B is a specialized fit for Coding. It sits in the middle of the current generation mix. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #2
ARunsEST.

Score

114.6

Fit status

Runs well

Fit: Runs well with 116K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

75.3 tok/s

Safe ctx

116K

Official ctx

131K

Support

native

TTFT

2571 ms

Weights: 5.5 GB

KV cache: 0.6 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 114.6 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Gemma

👁 Google

Gemma 4 E4B

FrontierReleased Apr 2026Hugging FaceOllamaLM Studio

Why it wins

Gemma 4 E4B is a specialized fit for Coding. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #3
ARunsEST.

Score

110.2

Fit status

Runs well

Fit: Runs well with 63K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

63.1 tok/s

Safe ctx

63K

Official ctx

128K

Support

native

TTFT

3068 ms

Weights: 4.9 GB

KV cache: 1.3 GB

Backend: cuda-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 110.2 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Codestral

👁 Mistral AI

Codestral Mamba 7B

CurrentReleased Jul 2024Hugging FaceOllama

Why it wins

Codestral Mamba 7B is a specialized fit for Coding. It sits in the middle of the current generation mix. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #4
ARunsEST.

Score

107.2

Fit status

Runs well

Fit: Runs well with 184K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

98 tok/s

Safe ctx

184K

Official ctx

262K

Support

native

TTFT

1976 ms

Weights: 4.3 GB

KV cache: 0.5 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 107.2 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Yi

👁 01.AI

Yi Coder 9B

CurrentReleased Sep 2024Hugging FaceOllamaLM Studio

Why it wins

Yi Coder 9B is a specialized fit for Coding. It sits in the middle of the current generation mix. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #5
BRunsEST.

Score

106.6

Fit status

Runs well

Fit: Runs well with 48K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

74.9 tok/s

Safe ctx

48K

Official ctx

131K

Support

native

TTFT

2586 ms

Weights: 5.5 GB

KV cache: 1.5 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 106.6 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Granite

👁 IBM

Granite 4.1 8B

CurrentReleased Apr 2026Hugging FaceOllama

Why it wins

Granite 4.1 8B is a specialized fit for Coding. It sits in the middle of the current generation mix. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #6
ARunsEST.

Score

102.3

Fit status

Runs well

Fit: Runs well with 33K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

83.3 tok/s

Safe ctx

33K

Official ctx

131K

Support

native

TTFT

2325 ms

Weights: 4.9 GB

KV cache: 2.4 GB

Backend: cuda-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 102.3 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Qwen

👁 Alibaba

Qwen 2.5 Coder 7B

CurrentReleased Sep 2024Hugging FaceOllamaLM Studio

Why it wins

Qwen 2.5 Coder 7B is a specialized fit for Coding. It sits in the middle of the current generation mix. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #7
ARunsEST.

Score

101.0

Fit status

Runs well

Fit: Runs well with 105K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

96.1 tok/s

Safe ctx

105K

Official ctx

131K

Support

native

TTFT

2014 ms

Weights: 4.3 GB

KV cache: 0.9 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 101.0 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Qwen

👁 Alibaba

Qwen 3 8B

FrontierReleased Apr 2025Hugging FaceOllamaLM Studio

Why it wins

Qwen 3 8B is viable for Coding, but is not the most specialized choice. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #8
SRunsEST.

Score

99.6

Fit status

Runs well

Fit: Runs well with 37K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

83.3 tok/s

Safe ctx

37K

Official ctx

131K

Support

native

TTFT

2325 ms

Weights: 4.9 GB

KV cache: 2.2 GB

Backend: cuda-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 99.6 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Nemotron

👁 NVIDIA

Nemotron Nano 9B v2

FrontierReleased Jun 2025Hugging FaceOllamaLM Studio

Why it wins

Nemotron Nano 9B v2 is a specialized fit for Coding. It is a recent-generation family, which helps on current local SOTA workloads. It should run, but memory headroom will be limited. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Tight · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Good · Bottleneck: Balanced

Rank #9
ATightEST.

Score

99.4

Fit status

Tight fit

Fit: Tight fit with 29K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

74 tok/s

Safe ctx

29K

Official ctx

131K

Support

native

TTFT

2616 ms

Weights: 5.5 GB

KV cache: 2.4 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 99.4 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Qwen

👁 Alibaba

Qwen 3.5 4B

FrontierReleased Jun 2025Hugging FaceOllamaLM Studio

Why it wins

Qwen 3.5 4B is a specialized fit for Coding. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #10
SRunsEST.

Score

93.6

Fit status

Runs well

Fit: Runs well with 48K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q6-k

Decode

56 tok/s

Safe ctx

48K

Official ctx

131K

Support

native

TTFT

3457 ms

Weights: 3.3 GB

KV cache: 2.2 GB

Backend: cuda-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 93.6 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.