VOOZH about

URL: https://huggingface.co/Dhptl/gemma-4-12B-GGUF

โ‡ฑ Dhptl/gemma-4-12B-GGUF ยท Hugging Face


gemma-4-12B โ€” GGUF Quantizations

Quantized GGUF versions of google/gemma-4-12B.

These files work with llama.cpp, Ollama, LM Studio, Jan, and any other GGUF-compatible runtime.

Quantized by Dhptl on June 09, 2026


๐Ÿ“ฆ Available Files

Filename Size Quant Use Case
gemma-4-12B-IQ4_XS.gguf 6.23 GB IQ4_XS Minimal RAM usage
gemma-4-12B-Q4_K_M.gguf 6.87 GB Q4_K_M โœ… Recommended General use, everyday inference
gemma-4-12B-Q5_K_M.gguf 7.96 GB Q5_K_M When you want a bit more accuracy
gemma-4-12B-Q8_0.gguf 11.80 GB Q8_0 High-quality inference, evaluation

Which file should I download?

If you have... Download this
8 GB RAM IQ4_XS โ€” Smallest, runs on 8GB
10 GB RAM Q4_K_M โ€” Best choice โœ…
12 GB RAM Q5_K_M โ€” Better quality
16 GB+ RAM Q8_0 โ€” Near-original quality

๐Ÿง  Original Model Quality Benchmarks

Results from Gemma 4 12B (Base) โ€” reported by Google. Results reported by Google on the base model. These benchmarks apply to the original BF16 model. GGUF quantization preserves ~98โ€“99% of quality for Q5/Q8 and ~96โ€“97% for Q4 variants.

Benchmark Category Score
MMLU Pro Text 77.2%
GPQA Diamond Science 78.8%
AIME 2026 (no tools) Math 77.5%
LiveCodeBench v6 Coding 72.0%
BigBench Extra Hard Reasoning 53.0%
MMMLU Multilingual 83.4%
MMMU Pro Vision 69.1%
MRCR v2 8-needle 128k Long Context 43.4%

๐Ÿ“Š Speed Benchmarks

Tested on: Intel(R) Core(TM) Ultra 7 258V | 31.5GB RAM | Intel Arc 140V (Vulkan)

Model Size Generation Prompt Processing
gemma-4-12B-IQ4_XS.gguf 6.23 GB 8.1 tok/s 249.7 tok/s
gemma-4-12B-Q4_K_M.gguf 6.87 GB 10.9 tok/s 232.2 tok/s
gemma-4-12B-Q5_K_M.gguf 7.96 GB 9.6 tok/s 244.9 tok/s
gemma-4-12B-Q8_0.gguf 11.8 GB 6.8 tok/s 267.2 tok/s

Generation speed = how fast the model outputs tokens (higher = better). Prompt processing = how fast it reads your input (higher = better). Results vary by hardware and system load.


๐Ÿš€ How to Use

With Ollama

ollama run Dhptl/gemma-4-12b

With llama.cpp

./llama-cli -m gemma-4-12B-Q4_K_M.gguf -p "Your prompt here" -n 512

With LM Studio

  1. Open LM Studio
  2. Search for Dhptl/gemma-4-12B
  3. Download your preferred quant
  4. Load and chat

With Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
 model_path="./gemma-4-12B-Q4_K_M.gguf",
 n_ctx=4096,
 n_gpu_layers=-1, # -1 = offload all layers to GPU
)

output = llm("Explain quantum computing in simple terms:", max_tokens=256)
print(output["choices"][0]["text"])

๐Ÿ”ง Quantization Details

Format Bits Description
Q4_K_M 4-bit K-quantization, medium โ€” Best size/quality balance
Q5_K_M 5-bit K-quantization, medium โ€” Higher quality
Q8_0 8-bit Near-lossless โ€” Largest GGUF file
IQ4_XS ~4-bit Importance-matrix quant โ€” Smallest with good quality

Quantization was done using llama.cpp.


โ„น๏ธ About the Original Model

  • Original Model: google/gemma-4-12B
  • Architecture: Gemma 4 Unified (multimodal โ€” text + vision capable)
  • Parameters: ~12 Billion
  • Context Length: 128K tokens
  • License: Gemma Terms of Use

๐Ÿ’ฌ Feedback

If you find issues or have questions, open a discussion.

If these quants are useful to you, please โญ the repo!

Downloads last month
1,738
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dhptl/gemma-4-12B-GGUF

Quantized
(31)
this model