VOOZH about

URL: https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit

⇱ deadbydawn101/gemma-4-E4B-mlx-4bit · Hugging Face


Gemma 4 E4B — MLX 4-bit | Tool Calling ✅ | Apple Silicon

The fastest 4B multimodal model on Apple Silicon. Tool calling, TurboQuant 4.6x KV compression, Opus Reasoning LoRA, Ollama ready. 4.86 GB.

Tool Calling ✅ · Built by RavenX AI · Apple Silicon Native

👁 Downloads
👁 TurboQuant
👁 LoRA
👁 Gemini CLI
👁 License


Gemma 4 E4B-it quantized to MLX 4-bit (affine, group_size=64) for Apple Silicon — with the full RavenX AI stack built on top: Opus reasoning fine-tuning, TurboQuant KV cache compression, and Gemini CLI terminal tooling.

4.86 GB. 131K context. Text + vision. Runs on any M-series Mac.


🗂 What's in this stack

Component What it does Link
This model Gemma 4 E4B 4-bit MLX — 4.86 GB, 131K ctx You are here
Opus Reasoning LoRA Adds <think>-tag reasoning, trained on Claude Opus 4.6 traces ↗ adapter repo
Fused version LoRA baked into weights — no adapter needed ↗ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
TurboQuant-MLX 4.6x KV cache compression — run longer contexts at same RAM ↗ GitHub
Gemini CLI fork MCP-enabled terminal AI agent with Gemini 3 + 1M ctx ↗ GitHub

Model Details

Property Value
Base model google/gemma-4-E4B-it
Architecture Gemma4ForConditionalGeneration
Parameters ~4B active
Modalities Text · Vision · Audio
Quantization 4-bit affine, group_size=64
File size 4.86 GB (down from ~17 GB bf16)
Context window 131,072 tokens
Vocab size 262,144
Hidden size 2,560
Layers 42 (35× sliding + 7× full attention)
Attention heads 8 (KV heads: 2)
Sliding window 512
Vision encoder 768 hidden · 16 layers · patch 16px

⚡ Performance (Apple Silicon)

Chip RAM Tok/sec (est)
M4 Max 128GB ~55–70
M3 Ultra 192GB ~60–80
M3 Pro 36GB ~35–50
M2 Pro 32GB ~20–30
M1 Air 16GB ~12–20

Runs entirely on unified memory — no GPU VRAM limits. Full model fits in ~6 GB, leaving 10+ GB for context.


🚀 Quickstart

Install

pip install mlx-lm mlx-vlm

Text generation

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": [
 {"type": "image", "image": "https://example.com/photo.jpg"},
 {"type": "text", "text": "Describe this image in detail."}
]}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)

CLI

mlx_lm.generate \
 --model deadbydawn101/gemma-4-E4B-mlx-4bit \
 --prompt "Write a Python function to find all primes below N." \
 --max-tokens 512

OpenAI-compatible server

mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

Ollama

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

🧠 Opus Reasoning + Claude Code LoRA

Fine-tune behavior with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and real Claude Code tool-use patterns.

What it teaches the model

Behavior Source
<think> tag chain-of-thought before every answer Opus 4.6 reasoning traces
Multi-step problem decomposition Crownelius/Opus-4.6-Reasoning-2100x-formatted
Tool call patterns (read/write/bash/search loops) 140 Claude Code session files
Structured completion style SFT on completions only (not memorization)

Training results

Dataset: 2,054 train · 109 val · SFT completions-only
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Runtime: ~6 min for 1,000 iterations @ ~190 tok/sec

Iter 10 → 2.277 ← cold start
Iter 20 → 0.097 ← style locked in fast
Iter 50 → 0.00063
Iter 100 → 0.0000398
Iter 200 → 0.0000067 (checkpoint)
Iter 1000 → ~3.5e-7 (final)

Loss collapsed early and hard — the Opus reasoning patterns transferred cleanly to Gemma 4's hybrid attention architecture.

Apply the LoRA

from mlx_vlm import load, generate

model, processor = load(
 "deadbydawn101/gemma-4-E4B-mlx-4bit",
 adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)

Or use the fused model (no adapter needed)

# LoRA baked directly into weights
mlx_lm.generate --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "..."

gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit


⚡ TurboQuant-MLX — 4.6x KV Cache Compression

TurboQuant-MLX is a RavenX AI project that compresses the KV cache using PolarQuant (rotation-based quantization) + QJL (1-bit residual correction) — enabling dramatically longer contexts at the same memory budget.

Without TurboQuant With TurboQuant
Context @ same RAM 8K 36K
KV cache growth Linear Compressed
Accuracy impact Near-zero
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Drop-in patch — one line before loading
cache_module.make_prompt_cache = lambda model, **kw: [
 TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
# Context is now compressed automatically — run it as normal

TurboQuant-MLX on GitHub · Release v2.0


💻 Gemini CLI — MCP Terminal Agent

RavenX AI's Gemini CLI fork is an MCP-enabled terminal AI agent — Google Search grounding, file ops, shell commands, web fetching, and 1M token context.

# Install
npm install -g @google/gemini-cli

# Run against local MLX server (self-hosted with this model)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080 "Analyze this codebase and suggest improvements"
  • Free tier: 60 req/min · 1,000 req/day
  • MCP support: connect any tool via Model Context Protocol
  • Built-in tools: search grounding, file read/write, shell, web fetch

🏗 Architecture

Gemma 4 uses hybrid sliding/full attention:

  • 35× sliding attention (window=512) — O(n) local context, fast
  • 7× full attention — global coherence at regular intervals

This gives near-linear memory scaling for long sequences while maintaining full document coherence — ideal for the TurboQuant + long-context use case.


💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability How
Code generation Gemini CLI reads your codebase, model reasons with <think> tags
Tool calling Native <|tool> tokens → Gemini CLI executes shell/file/web tools
Long context 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers Connect any MCP server — databases, APIs, custom tools
Search grounding Google Search built in — model gets live data
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
 "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

🛠️ Tool Calling (Function Calling)

Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.

Define tools and call them

from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")

tools = [
 {
 "type": "function",
 "function": {
 "name": "get_weather",
 "description": "Get the current weather for a location",
 "parameters": {
 "type": "object",
 "properties": {
 "location": {"type": "string", "description": "City and country"},
 "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
 },
 "required": ["location"]
 }
 }
 }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
 messages,
 tools=tools,
 add_generation_prompt=True,
 tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format

Parse tool calls and feed results back

# After tool execution, feed the result back
messages += [
 {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
 {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)

With mlx_vlm (multimodal + tools)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
prompt = apply_chat_template(
 processor, model.config, messages,
 tools=tools, add_generation_prompt=True
)

Tool token format (native)

Token Purpose
<|tool>...<tool|> Tool definition block
<|tool_call>call:name{args}<tool_call|> Model calls a tool
<|tool_response>...<tool_response|> Result returned to model

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit",
 "messages": [{"role": "user", "content": "Hello!"}]
 }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model": "deadbydawn101/gemma-4-E4B-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

🔧 Conversion Details

Step Detail
Source google/gemma-4-E4B-it (bfloat16, ~17 GB)
Tool mlx_vlm.convert --q-bits 4 --q-group-size 64 --q-mode affine
Platform Apple M4 Max 128GB
Output 4.86 GB · ~4.8 bits/weight · 3 shards
LoRA training mlx_vlm.lora SFT · rank=8 · alpha=16 · 1k iters
LoRA fusion mlx_lm fuse — baked into ravenx-opus variant

📦 Full RavenX Model Collection

Model Size Description
gemma-4-E4B-mlx-4bit 4.86 GB This model — clean 4-bit E4B base
gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit ~4.86 GB Fused: base + Opus reasoning LoRA baked in
gemma-4-E4B-opus-reasoning-claude-code-lora 658 MB LoRA adapter only
gemma-4-E2B-Heretic-Uncensored-mlx-4bit 3.34 GB 2B abliterated (uncensored)
gemma-4-21b-REAP-Tool-Calling-mlx-4bit 12 GB 21B REAP-pruned MoE

License

Gemma Terms of Use — free for research and commercial use with attribution.


Built with 🖤 by RavenX AI · TurboQuant-MLX · Gemini CLI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention
Downloads last month
411
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Model tree for deadbydawn101/gemma-4-E4B-mlx-4bit

Quantized
(241)
this model
Adapters
1 model
Finetunes
3 models

Dataset used to train deadbydawn101/gemma-4-E4B-mlx-4bit

Space using deadbydawn101/gemma-4-E4B-mlx-4bit 1

Collection including deadbydawn101/gemma-4-E4B-mlx-4bit