gemma-4-12B-it-qat-GGUF
gemma-4-12B-it-qat-q4_0-unquantized is a 12-billion-parameter instruction-tuned vision-language model from Google DeepMind, part of the Gemma 4 family, optimized using Quantization-Aware Training (QAT) to preserve bfloat16-level quality while significantly reducing memory requirements. It features a unified encoder-free architecture that projects raw image patches and audio waveforms directly into the LLM's embedding space, supports text, image, and audio modalities, and offers a 256K token context window. The model employs a hybrid attention mechanism interleaving local sliding window and full global attention, with multilingual support across 140+ languages, native function calling, configurable thinking/reasoning mode, and achieves strong benchmark scores including 77.2% on MMLU Pro, 78.8% on GPQA Diamond, and 77.5% on AIME 2026. The Q4_0 unquantized variant specifically refers to half-precision weights extracted from the QAT pipeline, making it ideal for custom downstream compilation and research rather than direct deployment.
Google DeepMind’s Gemma 4 Quantization-Aware Training (QAT) releases compress models by simulating lower precision during the training process itself. This drastically reduces VRAM requirements and accelerates local inference on consumer hardware and mobile devices while preserving the near-original quality of uncompressed baselines.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| gemma-4-12B-it-qat-q4_0-unquantized.BF16.gguf | BF16 | 23.8 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.F16.gguf | F16 | 23.8 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q2_K.gguf | Q2_K | 4.83 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q3_K_L.gguf | Q3_K_L | 6.57 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q3_K_M.gguf | Q3_K_M | 6.09 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q3_K_S.gguf | Q3_K_S | 5.53 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q4_0.gguf | Q4_0 | 6.98 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q4_K_M.gguf | Q4_K_M | 7.38 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q4_K_S.gguf | Q4_K_S | 7.02 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q5_0.gguf | Q5_0 | 8.34 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q5_K_M.gguf | Q5_K_M | 8.55 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q5_K_S.gguf | Q5_K_S | 8.34 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q6_K.gguf | Q6_K | 9.79 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.Q8_0.gguf | Q8_0 | 12.7 GB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.mmproj-bf16.gguf | mmproj-bf16 | 175 MB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.mmproj-f16.gguf | mmproj-f16 | 175 MB | Download |
| gemma-4-12B-it-qat-q4_0-unquantized.mmproj-q8_0.gguf | mmproj-q8_0 | 159 MB | Download |
llama.cpp
LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp
- Downloads last month
- 1,948
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for prithivMLmods/gemma-4-12B-it-qat-GGUF
Base model
google/gemma-4-12B