Voozh

LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Hardware Configuration

Select your GPU or set custom VRAM

Devices for parallel inference

Input Parameters

Batch Size:

Inputs processed simultaneously per step (affects throughput & latency)

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

16K

33K

66K

131K

Concurrent Users:

Number of users running inference simultaneously (affects memory usage and per-user performance)

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation