Precision for model weights during inference. Lower uses less VRAM but may affect quality.
KV Cache precision. Lower values reduce VRAM, especially for long sequences.
Hardware Configuration
Select your GPU or set custom VRAM
Devices for parallel inference
Input Parameters
Batch Size:
1
Inputs processed simultaneously per step (affects throughput & latency)
Sequence Length: 1,024
Max tokens per input; impacts KV cache (also affected by attention structure) & activations.
Concurrent Users:
Number of users running inference simultaneously (affects memory usage and per-user performance)
(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU
Input sequence length: 1,024 tokens
Configure model and hardware to enable simulation
©2025 ApX Machine Learning
0.0%
VRAM
0 GB
of 12 GB VRAM
Generation Speed: ...
Time to First Token: ~0ms
Total Throughput: ...
Est. GPU Rental: N/A (Local Only)
Mode: Inference | Batch: 1