gemma-4-E4B-it-qat-GGUF
gemma-4-E4B-it-qat-q4_0-unquantized is a mid-tier on-device-optimized instruction-tuned multimodal model from Google DeepMind, part of the Gemma 4 family, featuring 4.5 billion effective parameters (8B total including embeddings) optimized via Quantization-Aware Training (QAT) to retain near-bfloat16 quality while substantially lowering memory requirements. Designed for efficient local execution on laptops and capable mobile devices, it leverages Per-Layer Embeddings (PLE) for parameter efficiency and supports text, image, and audio modalities with a 128K token context window and 262K vocabulary across 140+ languages, using a hybrid attention mechanism with 512-token sliding window attention across 42 layers alongside a ~150M parameter vision encoder and ~300M parameter audio encoder. Capabilities include ASR, speech translation (CoVoST score of 35.54), image understanding, OCR, video frame analysis, native function calling, and configurable thinking/reasoning mode, with the E4B stepping up noticeably from the E2B in key benchmarks — scoring 69.4% on MMLU Pro, 58.6% on GPQA Diamond, 42.5% on AIME 2026, 52.0% on LiveCodeBench v6, 52.6% on MMMU Pro (vision), and 25.4% on the 128K long-context MRCR v2 task — making it a strong balance between the lightweight E2B and the larger server-class models, with the Q4_0 unquantized variant providing half-precision weights ideal for custom downstream compilation and research.
Google DeepMind’s Gemma 4 Quantization-Aware Training (QAT) releases compress models by simulating lower precision during the training process itself. This drastically reduces VRAM requirements and accelerates local inference on consumer hardware and mobile devices while preserving the near-original quality of uncompressed baselines.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| gemma-4-E4B-it-qat.BF16.gguf | BF16 | 14.9 GB | Download |
| gemma-4-E4B-it-qat.F16.gguf | F16 | 14.9 GB | Download |
| gemma-4-E4B-it-qat.F32.gguf | F32 | 29.9 GB | Download |
| gemma-4-E4B-it-qat.Q2_K.gguf | Q2_K | 4.38 GB | Download |
| gemma-4-E4B-it-qat.Q3_K_L.gguf | Q3_K_L | 4.99 GB | Download |
| gemma-4-E4B-it-qat.Q3_K_M.gguf | Q3_K_M | 4.82 GB | Download |
| gemma-4-E4B-it-qat.Q3_K_S.gguf | Q3_K_S | 4.63 GB | Download |
| gemma-4-E4B-it-qat.Q4_0.gguf | Q4_0 | 5.15 GB | Download |
| gemma-4-E4B-it-qat.Q4_K_M.gguf | Q4_K_M | 5.3 GB | Download |
| gemma-4-E4B-it-qat.Q4_K_S.gguf | Q4_K_S | 5.17 GB | Download |
| gemma-4-E4B-it-qat.Q5_0.gguf | Q5_0 | 5.65 GB | Download |
| gemma-4-E4B-it-qat.Q5_K_M.gguf | Q5_K_M | 5.72 GB | Download |
| gemma-4-E4B-it-qat.Q5_K_S.gguf | Q5_K_S | 5.65 GB | Download |
| gemma-4-E4B-it-qat.Q6_K.gguf | Q6_K | 6.17 GB | Download |
| gemma-4-E4B-it-qat.Q8_0.gguf | Q8_0 | 7.95 GB | Download |
| gemma-4-E4B-it-qat.mmproj-bf16.gguf | mmproj-bf16 | 992 MB | Download |
| gemma-4-E4B-it-qat.mmproj-f16.gguf | mmproj-f16 | 992 MB | Download |
| gemma-4-E4B-it-qat.mmproj-f32.gguf | mmproj-f32 | 1.91 GB | Download |
| gemma-4-E4B-it-qat.mmproj-q8_0.gguf | mmproj-q8_0 | 560 MB | Download |
llama.cpp
LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp
- Downloads last month
- 2,360
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for prithivMLmods/gemma-4-E4B-it-qat-GGUF
Base model
google/gemma-4-E4B