gemma-4-12B — GGUF Quantizations

Quantized GGUF versions of google/gemma-4-12B.

These files work with llama.cpp, Ollama, LM Studio, Jan, and any other GGUF-compatible runtime.

Quantized by Dhptl on June 09, 2026

📦 Available Files

Filename	Size	Quant	Use Case
`gemma-4-12B-IQ4_XS.gguf`	6.23 GB	`IQ4_XS`	Minimal RAM usage
`gemma-4-12B-Q4_K_M.gguf`	6.87 GB	`Q4_K_M` ✅ Recommended	General use, everyday inference
`gemma-4-12B-Q5_K_M.gguf`	7.96 GB	`Q5_K_M`	When you want a bit more accuracy
`gemma-4-12B-Q8_0.gguf`	11.80 GB	`Q8_0`	High-quality inference, evaluation

Which file should I download?

If you have...	Download this
8 GB RAM	`IQ4_XS` — Smallest, runs on 8GB
10 GB RAM	`Q4_K_M` — Best choice ✅
12 GB RAM	`Q5_K_M` — Better quality
16 GB+ RAM	`Q8_0` — Near-original quality

🧠 Original Model Quality Benchmarks

Results from Gemma 4 12B (Base) — reported by Google. Results reported by Google on the base model. These benchmarks apply to the original BF16 model. GGUF quantization preserves ~98–99% of quality for Q5/Q8 and ~96–97% for Q4 variants.

Benchmark	Category	Score
MMLU Pro	Text	77.2%
GPQA Diamond	Science	78.8%
AIME 2026 (no tools)	Math	77.5%
LiveCodeBench v6	Coding	72.0%
BigBench Extra Hard	Reasoning	53.0%
MMMLU	Multilingual	83.4%
MMMU Pro	Vision	69.1%
MRCR v2 8-needle 128k	Long Context	43.4%

📊 Speed Benchmarks

Tested on: Intel(R) Core(TM) Ultra 7 258V | 31.5GB RAM | Intel Arc 140V (Vulkan)

Model	Size	Generation	Prompt Processing
`gemma-4-12B-IQ4_XS.gguf`	6.23 GB	8.1 tok/s	249.7 tok/s
`gemma-4-12B-Q4_K_M.gguf`	6.87 GB	10.9 tok/s	232.2 tok/s
`gemma-4-12B-Q5_K_M.gguf`	7.96 GB	9.6 tok/s	244.9 tok/s
`gemma-4-12B-Q8_0.gguf`	11.8 GB	6.8 tok/s	267.2 tok/s

Generation speed = how fast the model outputs tokens (higher = better). Prompt processing = how fast it reads your input (higher = better). Results vary by hardware and system load.

🚀 How to Use

With Ollama

ollama run Dhptl/gemma-4-12b

With llama.cpp

./llama-cli -m gemma-4-12B-Q4_K_M.gguf -p "Your prompt here" -n 512

With LM Studio

Open LM Studio
Search for Dhptl/gemma-4-12B
Download your preferred quant
Load and chat

With Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
 model_path="./gemma-4-12B-Q4_K_M.gguf",
 n_ctx=4096,
 n_gpu_layers=-1, # -1 = offload all layers to GPU
)

output = llm("Explain quantum computing in simple terms:", max_tokens=256)
print(output["choices"][0]["text"])

🔧 Quantization Details

Format	Bits	Description
`Q4_K_M`	4-bit	K-quantization, medium — Best size/quality balance
`Q5_K_M`	5-bit	K-quantization, medium — Higher quality
`Q8_0`	8-bit	Near-lossless — Largest GGUF file
`IQ4_XS`	~4-bit	Importance-matrix quant — Smallest with good quality

Quantization was done using llama.cpp.

ℹ️ About the Original Model

Original Model: google/gemma-4-12B
Architecture: Gemma 4 Unified (multimodal — text + vision capable)
Parameters: ~12 Billion
Context Length: 128K tokens
License: Gemma Terms of Use

💬 Feedback

If you find issues or have questions, open a discussion.

If these quants are useful to you, please ⭐ the repo!

Downloads last month: 1,738

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dhptl/gemma-4-12B-GGUF

Base model

google/gemma-4-12B

Quantized

(31)

this model

URL: https://huggingface.co/Dhptl/gemma-4-12B-GGUF

⇱ Dhptl/gemma-4-12B-GGUF · Hugging Face