Gemma-4-e2bxOpus-4.7-turbo-GGUF
A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional “thinking” mode enabled via chat templates.
Model trained with full-fine-tuned target layers and performance base, adapt and short reasonable.
Example usage:
- For text only LLMs:
llama-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja - For multimodal models:
llama-mtmd-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja
📦 Available Model Files
gemma-4-e2b-it.Q8_0.gguf— Quantized model (Q8_0 for high quality)gemma-4-e2b-it.BF16-mmproj.gguf— Multimodal projection (required for full functionality)
🚀 Features
- Strong code generation & reasoning (CodeX-style distillation)
- Long context support (tested up to 131k tokens)
- Optimized for llama.cpp
- Supports structured chat templates (Jinja-based)
- Optional “thinking mode” for better reasoning traces
🖥️ Running with llama.cpp
Make sure you’re using a recent build of llama.cpp with:
- Flash Attention enabled
- Jinja/chat template support compiled
Start Server
llama-server \
-m gemma-4-e2b-it.Q8_0.gguf \
--port 53281 \
-c 131072 \
--parallel 1 \
--flash-attn on \
--no-context-shift \
-ngl -1 \
--jinja \
--chat-template-kwargs "{\"enable_thinking\": true}" \
--mmproj gemma-4-e2b-it.BF16-mmproj.gguf
Key Flags Explained
-c 131072→ Enables long context (131k tokens)--flash-attn on→ Faster attention (requires compatible GPU)-ngl -1→ Offload all layers to GPU--jinja→ Enables chat template rendering--chat-template-kwargs→ Activates thinking mode--mmproj→ Required for multimodal projection
Test Request
curl http://localhost:53281/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a Python function to reverse a linked list"}
]
}'
🧠 Notes on Thinking Mode
When enable_thinking=true, the model may:
- Produce intermediate reasoning steps
- Improve structured problem solving
- Slightly increase latency
Disable it if you need faster responses.
🦙 Running with Ollama
Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.
Create a Modelfile:
FROM ./gemma-4-e2b-it.Q8_0.gguf
PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"
TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""
# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."
Build & Run
ollama create gemma-4-opus -f Modelfile
ollama run gemma-4-opus
⚙️ Recommended Settings
| Use Case | Context | GPU Layers | Notes |
|---|---|---|---|
| Coding assistant | 32k–64k | Full (-1) | Best balance |
| Long reasoning | 131k | Full | Needs high VRAM |
| Low VRAM setup | 8k–16k | Partial | Disable flash-attn |
⚠️ Limitations
- Requires significant VRAM for full 131k context
- Thinking mode increases latency
- Multimodal projection file must match model variant
📜 License
Follow the original Gemma license and any additional terms from this distillation.
🙌 Credits
- Base model: Google Gemma family
- Distillation: Code-focused adaptation
- Runtime: llama.cpp ecosystem
- Downloads last month
- 1,032
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware
2-bit
3-bit
4-bit
5-bit
8-bit
16-bit
Model tree for nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF
Datasets used to train nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF
Collection including nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF
4 items • Updated • 1
