VOOZH about

URL: https://huggingface.co/nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

⇱ nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF · Hugging Face


Gemma-4-e2bxOpus-4.7-turbo-GGUF

A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional “thinking” mode enabled via chat templates.

Model trained with full-fine-tuned target layers and performance base, adapt and short reasonable.


Example usage:

  • For text only LLMs: llama-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja
  • For multimodal models: llama-mtmd-cli -hf nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF --jinja

📦 Available Model Files

  • gemma-4-e2b-it.Q8_0.gguf — Quantized model (Q8_0 for high quality)
  • gemma-4-e2b-it.BF16-mmproj.gguf — Multimodal projection (required for full functionality)

🚀 Features

  • Strong code generation & reasoning (CodeX-style distillation)
  • Long context support (tested up to 131k tokens)
  • Optimized for llama.cpp
  • Supports structured chat templates (Jinja-based)
  • Optional “thinking mode” for better reasoning traces

🖥️ Running with llama.cpp

Make sure you’re using a recent build of llama.cpp with:

  • Flash Attention enabled
  • Jinja/chat template support compiled

Start Server

llama-server \
 -m gemma-4-e2b-it.Q8_0.gguf \
 --port 53281 \
 -c 131072 \
 --parallel 1 \
 --flash-attn on \
 --no-context-shift \
 -ngl -1 \
 --jinja \
 --chat-template-kwargs "{\"enable_thinking\": true}" \
 --mmproj gemma-4-e2b-it.BF16-mmproj.gguf

Key Flags Explained

  • -c 131072 → Enables long context (131k tokens)
  • --flash-attn on → Faster attention (requires compatible GPU)
  • -ngl -1 → Offload all layers to GPU
  • --jinja → Enables chat template rendering
  • --chat-template-kwargs → Activates thinking mode
  • --mmproj → Required for multimodal projection

Test Request

curl http://localhost:53281/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "messages": [
 {"role": "user", "content": "Write a Python function to reverse a linked list"}
 ]
 }'

🧠 Notes on Thinking Mode

When enable_thinking=true, the model may:

  • Produce intermediate reasoning steps
  • Improve structured problem solving
  • Slightly increase latency

Disable it if you need faster responses.


🦙 Running with Ollama

Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.

Create a Modelfile:

FROM ./gemma-4-e2b-it.Q8_0.gguf

PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"

TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""

# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."

Build & Run

ollama create gemma-4-opus -f Modelfile
ollama run gemma-4-opus

⚙️ Recommended Settings

Use Case Context GPU Layers Notes
Coding assistant 32k–64k Full (-1) Best balance
Long reasoning 131k Full Needs high VRAM
Low VRAM setup 8k–16k Partial Disable flash-attn

⚠️ Limitations

  • Requires significant VRAM for full 131k context
  • Thinking mode increases latency
  • Multimodal projection file must match model variant

📜 License

Follow the original Gemma license and any additional terms from this distillation.


🙌 Credits

  • Base model: Google Gemma family
  • Distillation: Code-focused adaptation
  • Runtime: llama.cpp ecosystem
Downloads last month
1,032
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

8-bit

16-bit

Model tree for nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

Quantized
(238)
this model

Datasets used to train nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF

Collection including nphearum/Gemma-4-e2bxOpus-4.7-turbo-GGUF