gemma-4-26B-A4B-it-qat-GGUF
google/gemma-4-26B-A4B-it-qat-q4_0-unquantized is a Mixture-of-Experts (MoE) instruction-tuned multimodal model from Google DeepMind, part of the Gemma 4 family, featuring 25.2 billion total parameters but only 3.8 billion active parameters during inference, optimized via Quantization-Aware Training (QAT) to preserve near-bfloat16 quality at significantly reduced memory requirements. Its sparse MoE architecture activates just 4B of its 128 total experts (plus 1 shared) per token across 30 layers with a 1024-token sliding window, making it run nearly as fast as a dedicated 4B model while delivering quality competitive with much larger dense models — scoring an impressive 82.6% on MMLU Pro, 82.3% on GPQA Diamond, 88.3% on AIME 2026, 77.1% on LiveCodeBench v6, 73.8% on MMMU Pro (vision), and 44.1% on the 256K long-context MRCR v2 task. Supporting text and image modalities (no audio) with a 256K token context window, a ~550M parameter vision encoder, and a 262K vocabulary across 140+ languages, the model enables image understanding, OCR, video frame analysis, native function calling, and configurable thinking/reasoning mode, with the Q4_0 unquantized variant providing half-precision weights extracted from the QAT pipeline, making it ideal for custom downstream compilation and research targeting high-throughput, cost-efficient server-side deployment.
Google DeepMind’s Gemma 4 Quantization-Aware Training (QAT) releases compress models by simulating lower precision during the training process itself. This drastically reduces VRAM requirements and accelerates local inference on consumer hardware and mobile devices while preserving the near-original quality of uncompressed baselines.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| gemma-4-26B-A4B-it-qat.BF16.gguf | BF16 | 50.5 GB | Download |
| gemma-4-26B-A4B-it-qat.F16.gguf | F16 | 50.5 GB | Download |
| gemma-4-26B-A4B-it-qat.F32.gguf | F32 | 101 GB | Download |
| gemma-4-26B-A4B-it-qat.Q2_K.gguf | Q2_K | 10.6 GB | Download |
| gemma-4-26B-A4B-it-qat.Q3_K_L.gguf | Q3_K_L | 13.8 GB | Download |
| gemma-4-26B-A4B-it-qat.Q3_K_M.gguf | Q3_K_M | 13.3 GB | Download |
| gemma-4-26B-A4B-it-qat.Q3_K_S.gguf | Q3_K_S | 12.2 GB | Download |
| gemma-4-26B-A4B-it-qat.Q4_0.gguf | Q4_0 | 14.4 GB | Download |
| gemma-4-26B-A4B-it-qat.Q4_K_M.gguf | Q4_K_M | 16.8 GB | Download |
| gemma-4-26B-A4B-it-qat.Q4_K_S.gguf | Q4_K_S | 15.5 GB | Download |
| gemma-4-26B-A4B-it-qat.Q5_0.gguf | Q5_0 | 17.5 GB | Download |
| gemma-4-26B-A4B-it-qat.Q5_K_M.gguf | Q5_K_M | 19.1 GB | Download |
| gemma-4-26B-A4B-it-qat.Q5_K_S.gguf | Q5_K_S | 18 GB | Download |
| gemma-4-26B-A4B-it-qat.Q6_K.gguf | Q6_K | 22.6 GB | Download |
| gemma-4-26B-A4B-it-qat.Q8_0.gguf | Q8_0 | 26.9 GB | Download |
| gemma-4-26B-A4B-it-qat.mmproj-bf16.gguf | mmproj-bf16 | 1.19 GB | Download |
| gemma-4-26B-A4B-it-qat.mmproj-f16.gguf | mmproj-f16 | 1.19 GB | Download |
| gemma-4-26B-A4B-it-qat.mmproj-f32.gguf | mmproj-f32 | 2.29 GB | Download |
| gemma-4-26B-A4B-it-qat.mmproj-q8_0.gguf | mmproj-q8_0 | 806 MB | Download |
llama.cpp
LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp
- Downloads last month
- 2,750
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF
Base model
google/gemma-4-26B-A4B