VOOZH about

URL: https://huggingface.co/mlx-community/gemma-4-12B-it-qat-OptiQ-4bit

⇱ mlx-community/gemma-4-12B-it-qat-OptiQ-4bit · Hugging Face


mlx-community/gemma-4-12B-it-qat-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

A 4-bit mixed-precision MLX quant produced by mlx-optiq, built on Google's quantization-aware-trained (QAT) Gemma-4 base. OptIQ's sensitivity-guided per-layer bit allocation is applied on top of weights that were trained to survive low-bit quantization, and it still beats a uniform 4-bit quant of the same QAT base by +1.37 Capability Score points.

This is a quant of google/gemma-4-12B-it-qat-q4_0-unquantized. Per-layer bit-widths come from a KL-divergence sensitivity pass on a six-domain calibration mix (prose, reasoning, code, agent, tool-call, constraint-bearing instructions). Sensitive layers go to 8-bit, robust ones stay at 4-bit.

Quantization details

Property Value
Base google/gemma-4-12B-it-qat-q4_0-unquantized (QAT)
Predominant precision 4-bit
Components at 8-bit (sensitive) 157
Components at 4-bit (robust) 171
Total quantized components 328
Achieved bits-per-weight 5.25
Group size 64
Reference for sensitivity uniform 4-bit (streamed)
Calibration mix six-domain mix
Vision bf16 sidecar (optiq_vision.safetensors), image+text via optiq
Speculative drafter google/gemma-4-12B-it-qat-q4_0-unquantized-assistant via optiq serve --drafter

Capability Score

Six-metric mean (MMLU, GSM8K, IFEval, BFCL, HumanEval, HashHop), scored against a uniform 4-bit quant of the same QAT base. That comparison isolates what the mixed-precision allocation adds, holding the base fixed.

Benchmark Uniform-4 (QAT base) This model (OptIQ, QAT base) Delta
MMLU (5-shot, 1000) 50.9% 52.5% +1.6
GSM8K (1000) 93.1% 93.3% +0.2
IFEval (full, strict) 72.3% 73.6% +1.3
BFCL-V3 simple (200) 72.5% 72.0% -0.5
HumanEval (pass@1, 164) 90.9% 91.5% +0.6
HashHop (long-context) 30.0% 35.0% +5.0
Capability Score (mean) 68.27 69.64 +1.37

OptIQ adds +1.37 points over uniform 4-bit on this QAT base, consistent with the margin on the other QAT Gemma-4 sizes (E2B +2.09, E4B +1.19): the per-layer allocation keeps paying off even after QAT has made the weights more quantization-robust. The mixed quant is 5.25 bits-per-weight (about 8.3 GB on disk) versus 4.0 bits-per-weight (about 6.2 GB) for uniform 4-bit, with the extra budget spent on the layers that need it.

Usage

The 12B is the unified Gemma-4 (model_type: gemma4_unified), so it needs mlx-lm from main and import optiq (the unified text tower is not in the 0.31.3 PyPI release; the main build also reports 0.31.3, so install from git, not a version pin):

pip install -U mlx-optiq "mlx-lm @ git+https://github.com/ml-explore/mlx-lm.git"
import optiq # registers the gemma4_unified model type
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-4-12B-it-qat-OptiQ-4bit")
print(generate(model, tokenizer, "Explain mixed-precision quantization.", max_tokens=256))

Image+text input and the speculative drafter run through mlx-optiq:

pip install mlx-optiq
optiq serve --model mlx-community/gemma-4-12B-it-qat-OptiQ-4bit \
 --drafter google/gemma-4-12B-it-qat-q4_0-unquantized-assistant

The language and image+text paths both run through optiq. The bf16 vision tower rides in optiq_vision.safetensors, which mlx-lm ignores (it globs model*.safetensors), so both paths work from one artifact.

Downloads last month
1,073
Safetensors
Model size
12B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Model tree for mlx-community/gemma-4-12B-it-qat-OptiQ-4bit

Quantized
(30)
this model