gemma-4-26B-A4B-it-qat-GGUF

google/gemma-4-26B-A4B-it-qat-q4_0-unquantized is a Mixture-of-Experts (MoE) instruction-tuned multimodal model from Google DeepMind, part of the Gemma 4 family, featuring 25.2 billion total parameters but only 3.8 billion active parameters during inference, optimized via Quantization-Aware Training (QAT) to preserve near-bfloat16 quality at significantly reduced memory requirements. Its sparse MoE architecture activates just 4B of its 128 total experts (plus 1 shared) per token across 30 layers with a 1024-token sliding window, making it run nearly as fast as a dedicated 4B model while delivering quality competitive with much larger dense models — scoring an impressive 82.6% on MMLU Pro, 82.3% on GPQA Diamond, 88.3% on AIME 2026, 77.1% on LiveCodeBench v6, 73.8% on MMMU Pro (vision), and 44.1% on the 256K long-context MRCR v2 task. Supporting text and image modalities (no audio) with a 256K token context window, a ~550M parameter vision encoder, and a 262K vocabulary across 140+ languages, the model enables image understanding, OCR, video frame analysis, native function calling, and configurable thinking/reasoning mode, with the Q4_0 unquantized variant providing half-precision weights extracted from the QAT pipeline, making it ideal for custom downstream compilation and research targeting high-throughput, cost-efficient server-side deployment.

Google DeepMind’s Gemma 4 Quantization-Aware Training (QAT) releases compress models by simulating lower precision during the training process itself. This drastically reduces VRAM requirements and accelerates local inference on consumer hardware and mobile devices while preserving the near-original quality of uncompressed baselines.

Model Files

File Name	Quant Type	File Size	File Link
gemma-4-26B-A4B-it-qat.BF16.gguf	BF16	50.5 GB	Download
gemma-4-26B-A4B-it-qat.F16.gguf	F16	50.5 GB	Download
gemma-4-26B-A4B-it-qat.F32.gguf	F32	101 GB	Download
gemma-4-26B-A4B-it-qat.Q2_K.gguf	Q2_K	10.6 GB	Download
gemma-4-26B-A4B-it-qat.Q3_K_L.gguf	Q3_K_L	13.8 GB	Download
gemma-4-26B-A4B-it-qat.Q3_K_M.gguf	Q3_K_M	13.3 GB	Download
gemma-4-26B-A4B-it-qat.Q3_K_S.gguf	Q3_K_S	12.2 GB	Download
gemma-4-26B-A4B-it-qat.Q4_0.gguf	Q4_0	14.4 GB	Download
gemma-4-26B-A4B-it-qat.Q4_K_M.gguf	Q4_K_M	16.8 GB	Download
gemma-4-26B-A4B-it-qat.Q4_K_S.gguf	Q4_K_S	15.5 GB	Download
gemma-4-26B-A4B-it-qat.Q5_0.gguf	Q5_0	17.5 GB	Download
gemma-4-26B-A4B-it-qat.Q5_K_M.gguf	Q5_K_M	19.1 GB	Download
gemma-4-26B-A4B-it-qat.Q5_K_S.gguf	Q5_K_S	18 GB	Download
gemma-4-26B-A4B-it-qat.Q6_K.gguf	Q6_K	22.6 GB	Download
gemma-4-26B-A4B-it-qat.Q8_0.gguf	Q8_0	26.9 GB	Download
gemma-4-26B-A4B-it-qat.mmproj-bf16.gguf	mmproj-bf16	1.19 GB	Download
gemma-4-26B-A4B-it-qat.mmproj-f16.gguf	mmproj-f16	1.19 GB	Download
gemma-4-26B-A4B-it-qat.mmproj-f32.gguf	mmproj-f32	2.29 GB	Download
gemma-4-26B-A4B-it-qat.mmproj-q8_0.gguf	mmproj-q8_0	806 MB	Download

llama.cpp

LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp

Downloads last month: 2,750

GGUF

Model size

25B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Model tree for prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

google/gemma-4-26B-A4B-it-qat-q4_0-unquantized

Quantized

(22)

this model

Collection including prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF

+ quantization-aware training (qat) ggufs for the gemma 4 family of models • 6 items • Updated 2 days ago • 1

URL: https://huggingface.co/prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF

⇱ prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF · Hugging Face

gemma-4-26B-A4B-it-qat-GGUF

Model Files

llama.cpp

Model tree for prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF

Collection including prithivMLmods/gemma-4-26B-A4B-it-qat-GGUF