VOOZH about

URL: https://apxml.com/tools/vram-calculator

⇱ Can You Run This LLM? VRAM Calculator (Nvidia GPU and Apple Silicon)


LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Hardware Configuration

Select your GPU or set custom VRAM

Devices for parallel inference

Input Parameters

Batch Size:

1

Inputs processed simultaneously per step (affects throughput & latency)

1
2
4
6
8

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

8K
16K
33K
66K
131K

Concurrent Users:

1

Number of users running inference simultaneously (affects memory usage and per-user performance)

1
2
4
6
8

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Time to First Token: ~0ms

Total Throughput: ...

Est. GPU Rental: N/A (Local Only)

Mode: Inference | Batch: 1