gemma-4-E2B-it-qat-GGUF
gemma-4-E2B-it-qat-q4_0-unquantized is a compact, on-device-optimized instruction-tuned multimodal model from Google DeepMind, part of the Gemma 4 family, featuring 2.3 billion effective parameters (5.1B total including embeddings) optimized via Quantization-Aware Training (QAT) to maintain near-bfloat16 quality at significantly reduced memory footprint. Designed specifically for efficient local execution on laptops and mobile devices, it employs Per-Layer Embeddings (PLE) for maximum parameter efficiency and supports text, image, and audio modalities with a 128K token context window and a 262K vocabulary across 140+ languages. The model uses a hybrid attention mechanism interleaving local sliding window attention (512 tokens) with full global attention across 35 layers, alongside a ~150M parameter vision encoder and ~300M parameter audio encoder, enabling capabilities like ASR, speech translation, image understanding, OCR, video frame analysis, native function calling, and configurable thinking/reasoning mode. The Q4_0 unquantized variant provides half-precision weights extracted from the QAT pipeline, making it suited for custom downstream compilation and research, with benchmark scores of 60.0% on MMLU Pro, 43.4% on GPQA Diamond, 33.47 on CoVoST audio translation, and 19.1% on the 128K long-context MRCR v2 task.
Google DeepMind’s Gemma 4 Quantization-Aware Training (QAT) releases compress models by simulating lower precision during the training process itself. This drastically reduces VRAM requirements and accelerates local inference on consumer hardware and mobile devices while preserving the near-original quality of uncompressed baselines.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| gemma-4-E2B-it-qat.BF16.gguf | BF16 | 9.27 GB | Download |
| gemma-4-E2B-it-qat.F16.gguf | F16 | 9.27 GB | Download |
| gemma-4-E2B-it-qat.F32.gguf | F32 | 18.5 GB | Download |
| gemma-4-E2B-it-qat.Q2_K.gguf | Q2_K | 2.98 GB | Download |
| gemma-4-E2B-it-qat.Q3_K_L.gguf | Q3_K_L | 3.27 GB | Download |
| gemma-4-E2B-it-qat.Q3_K_M.gguf | Q3_K_M | 3.19 GB | Download |
| gemma-4-E2B-it-qat.Q3_K_S.gguf | Q3_K_S | 3.1 GB | Download |
| gemma-4-E2B-it-qat.Q4_0.gguf | Q4_0 | 3.35 GB | Download |
| gemma-4-E2B-it-qat.Q4_K_M.gguf | Q4_K_M | 3.42 GB | Download |
| gemma-4-E2B-it-qat.Q4_K_S.gguf | Q4_K_S | 3.35 GB | Download |
| gemma-4-E2B-it-qat.Q5_0.gguf | Q5_0 | 3.58 GB | Download |
| gemma-4-E2B-it-qat.Q5_K_M.gguf | Q5_K_M | 3.62 GB | Download |
| gemma-4-E2B-it-qat.Q5_K_S.gguf | Q5_K_S | 3.58 GB | Download |
| gemma-4-E2B-it-qat.Q6_K.gguf | Q6_K | 3.83 GB | Download |
| gemma-4-E2B-it-qat.Q8_0.gguf | Q8_0 | 4.93 GB | Download |
| gemma-4-E2B-it-qat.mmproj-bf16.gguf | mmproj-bf16 | 987 MB | Download |
| gemma-4-E2B-it-qat.mmproj-f16.gguf | mmproj-f16 | 987 MB | Download |
| gemma-4-E2B-it-qat.mmproj-f32.gguf | mmproj-f32 | 1.9 GB | Download |
| gemma-4-E2B-it-qat.mmproj-q8_0.gguf | mmproj-q8_0 | 557 MB | Download |
llama.cpp
LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp
- Downloads last month
- 2,176
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for prithivMLmods/gemma-4-E2B-it-qat-GGUF
Base model
google/gemma-4-E2B