Gemma-4-e2bxOpus-4.7-turbo-GGUF

A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional “thinking” mode enabled via chat templates.

Model trained with full-fine-tuned target layers and performance base, adapt and short reasonable.

Example usage:

For text only LLMs: llama-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja
For multimodal models: llama-mtmd-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja

📦 Available Model Files

gemma-4-e2b-it.Q8_0.gguf — Quantized model (Q8_0 for high quality)
gemma-4-e2b-it.BF16-mmproj.gguf — Multimodal projection (required for full functionality)

🚀 Features

Strong code generation & reasoning (CodeX-style distillation)
Long context support (tested up to 131k tokens)
Optimized for llama.cpp
Supports structured chat templates (Jinja-based)
Optional “thinking mode” for better reasoning traces

🖥️ Running with llama.cpp

Make sure you’re using a recent build of llama.cpp with:

Flash Attention enabled
Jinja/chat template support compiled

Start Server

llama-server \
 -m gemma-4-e2b-it.Q8_0.gguf \
 --port 53281 \
 -c 131072 \
 --parallel 1 \
 --flash-attn on \
 --no-context-shift \
 -ngl -1 \
 --jinja \
 --chat-template-kwargs "{\"enable_thinking\": true}" \
 --mmproj gemma-4-e2b-it.BF16-mmproj.gguf

Key Flags Explained

-c 131072 → Enables long context (131k tokens)
--flash-attn on → Faster attention (requires compatible GPU)
-ngl -1 → Offload all layers to GPU
--jinja → Enables chat template rendering
--chat-template-kwargs → Activates thinking mode
--mmproj → Required for multimodal projection

Test Request

curl http://localhost:53281/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "messages": [
 {"role": "user", "content": "Write a Python function to reverse a linked list"}
 ]
 }'

🧠 Notes on Thinking Mode

When enable_thinking=true, the model may:

Produce intermediate reasoning steps
Improve structured problem solving
Slightly increase latency

Disable it if you need faster responses.

🦙 Running with Ollama

Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.

Create a Modelfile:

FROM ./gemma-4-e2b-it.Q8_0.gguf

PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"

TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""

# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."

Build & Run

ollama create gemma-4-opus -f Modelfile
ollama run gemma-4-opus

⚙️ Recommended Settings

Use Case	Context	GPU Layers	Notes
Coding assistant	32k–64k	Full (-1)	Best balance
Long reasoning	131k	Full	Needs high VRAM
Low VRAM setup	8k–16k	Partial	Disable flash-attn

⚠️ Limitations

Requires significant VRAM for full 131k context
Thinking mode increases latency
Multimodal projection file must match model variant

📜 License

Follow the original Gemma license and any additional terms from this distillation.

🙌 Credits

Base model: Google Gemma family
Distillation: Code-focused adaptation
Runtime: llama.cpp ecosystem

Downloads last month: 1,032

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

8-bit

16-bit

Model tree for nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Quantized

(238)

this model

Datasets used to train nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

Collection including nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

4 items • Updated May 18 • 1

URL: https://huggingface.co/nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

⇱ nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF · Hugging Face