Gemma 4 E4B — MLX 4-bit | Tool Calling ✅ | Apple Silicon

The fastest 4B multimodal model on Apple Silicon. Tool calling, TurboQuant 4.6x KV compression, Opus Reasoning LoRA, Ollama ready. 4.86 GB.

Tool Calling ✅ · Built by RavenX AI · Apple Silicon Native

👁 Downloads
👁 TurboQuant
👁 LoRA
👁 Gemini CLI
👁 License

Gemma 4 E4B-it quantized to MLX 4-bit (affine, group_size=64) for Apple Silicon — with the full RavenX AI stack built on top: Opus reasoning fine-tuning, TurboQuant KV cache compression, and Gemini CLI terminal tooling.

4.86 GB. 131K context. Text + vision. Runs on any M-series Mac.

🗂 What's in this stack

Component	What it does	Link
This model	Gemma 4 E4B 4-bit MLX — 4.86 GB, 131K ctx	You are here
Opus Reasoning LoRA	Adds `<think>`-tag reasoning, trained on Claude Opus 4.6 traces	↗ adapter repo
Fused version	LoRA baked into weights — no adapter needed	↗ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
TurboQuant-MLX	4.6x KV cache compression — run longer contexts at same RAM	↗ GitHub
Gemini CLI fork	MCP-enabled terminal AI agent with Gemini 3 + 1M ctx	↗ GitHub

Model Details

Property	Value
Base model	google/gemma-4-E4B-it
Architecture	Gemma4ForConditionalGeneration
Parameters	~4B active
Modalities	Text · Vision · Audio
Quantization	4-bit affine, group_size=64
File size	4.86 GB (down from ~17 GB bf16)
Context window	131,072 tokens
Vocab size	262,144
Hidden size	2,560
Layers	42 (35× sliding + 7× full attention)
Attention heads	8 (KV heads: 2)
Sliding window	512
Vision encoder	768 hidden · 16 layers · patch 16px

⚡ Performance (Apple Silicon)

Chip	RAM	Tok/sec (est)
M4 Max	128GB	~55–70
M3 Ultra	192GB	~60–80
M3 Pro	36GB	~35–50
M2 Pro	32GB	~20–30
M1 Air	16GB	~12–20

Runs entirely on unified memory — no GPU VRAM limits. Full model fits in ~6 GB, leaving 10+ GB for context.

🚀 Quickstart

Install

pip install mlx-lm mlx-vlm

Text generation

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": [
 {"type": "image", "image": "https://example.com/photo.jpg"},
 {"type": "text", "text": "Describe this image in detail."}
]}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)

CLI

mlx_lm.generate \
 --model deadbydawn101/gemma-4-E4B-mlx-4bit \
 --prompt "Write a Python function to find all primes below N." \
 --max-tokens 512

OpenAI-compatible server

mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

Ollama

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

🧠 Opus Reasoning + Claude Code LoRA

Fine-tune behavior with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and real Claude Code tool-use patterns.

What it teaches the model

Behavior	Source
`<think>` tag chain-of-thought before every answer	Opus 4.6 reasoning traces
Multi-step problem decomposition	Crownelius/Opus-4.6-Reasoning-2100x-formatted
Tool call patterns (read/write/bash/search loops)	140 Claude Code session files
Structured completion style	SFT on completions only (not memorization)

Training results

Dataset: 2,054 train · 109 val · SFT completions-only
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Runtime: ~6 min for 1,000 iterations @ ~190 tok/sec

Iter 10 → 2.277 ← cold start
Iter 20 → 0.097 ← style locked in fast
Iter 50 → 0.00063
Iter 100 → 0.0000398
Iter 200 → 0.0000067 (checkpoint)
Iter 1000 → ~3.5e-7 (final)

Loss collapsed early and hard — the Opus reasoning patterns transferred cleanly to Gemma 4's hybrid attention architecture.

Apply the LoRA

from mlx_vlm import load, generate

model, processor = load(
 "deadbydawn101/gemma-4-E4B-mlx-4bit",
 adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)

Or use the fused model (no adapter needed)

# LoRA baked directly into weights
mlx_lm.generate --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "..."

→ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

⚡ TurboQuant-MLX — 4.6x KV Cache Compression

TurboQuant-MLX is a RavenX AI project that compresses the KV cache using PolarQuant (rotation-based quantization) + QJL (1-bit residual correction) — enabling dramatically longer contexts at the same memory budget.

	Without TurboQuant	With TurboQuant
Context @ same RAM	8K	36K
KV cache growth	Linear	Compressed
Accuracy impact	—	Near-zero

from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Drop-in patch — one line before loading
cache_module.make_prompt_cache = lambda model, **kw: [
 TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
# Context is now compressed automatically — run it as normal

→ TurboQuant-MLX on GitHub · Release v2.0

💻 Gemini CLI — MCP Terminal Agent

RavenX AI's Gemini CLI fork is an MCP-enabled terminal AI agent — Google Search grounding, file ops, shell commands, web fetching, and 1M token context.

# Install
npm install -g @google/gemini-cli

# Run against local MLX server (self-hosted with this model)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080 "Analyze this codebase and suggest improvements"

Free tier: 60 req/min · 1,000 req/day
MCP support: connect any tool via Model Context Protocol
Built-in tools: search grounding, file read/write, shell, web fetch

🏗 Architecture

Gemma 4 uses hybrid sliding/full attention:

35× sliding attention (window=512) — O(n) local context, fast
7× full attention — global coherence at regular intervals

This gives near-linear memory scaling for long sequences while maintaining full document coherence — ideal for the TurboQuant + long-context use case.

💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability	How
Code generation	Gemini CLI reads your codebase, model reasons with `<think>` tags
Tool calling	Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools
Long context	1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers	Connect any MCP server — databases, APIs, custom tools
Search grounding	Google Search built in — model gets live data

# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
 "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

🛠️ Tool Calling (Function Calling)

Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.

Define tools and call them

from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")

tools = [
 {
 "type": "function",
 "function": {
 "name": "get_weather",
 "description": "Get the current weather for a location",
 "parameters": {
 "type": "object",
 "properties": {
 "location": {"type": "string", "description": "City and country"},
 "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
 },
 "required": ["location"]
 }
 }
 }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
 messages,
 tools=tools,
 add_generation_prompt=True,
 tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format

Parse tool calls and feed results back

# After tool execution, feed the result back
messages += [
 {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
 {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)

With mlx_vlm (multimodal + tools)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
prompt = apply_chat_template(
 processor, model.config, messages,
 tools=tools, add_generation_prompt=True
)

Tool token format (native)

Token	Purpose
`<\|tool>...<tool\|>`	Tool definition block
`<\|tool_call>call:name{args}<tool_call\|>`	Model calls a tool
`<\|tool_response>...<tool_response\|>`	Result returned to model

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit",
 "messages": [{"role": "user", "content": "Hello!"}]
 }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model": "deadbydawn101/gemma-4-E4B-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

🔧 Conversion Details

Step	Detail
Source	`google/gemma-4-E4B-it` (bfloat16, ~17 GB)
Tool	`mlx_vlm.convert --q-bits 4 --q-group-size 64 --q-mode affine`
Platform	Apple M4 Max 128GB
Output	4.86 GB · ~4.8 bits/weight · 3 shards
LoRA training	`mlx_vlm.lora` SFT · rank=8 · alpha=16 · 1k iters
LoRA fusion	`mlx_lm fuse` — baked into ravenx-opus variant

📦 Full RavenX Model Collection

Model	Size	Description
gemma-4-E4B-mlx-4bit	4.86 GB	This model — clean 4-bit E4B base
gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit	~4.86 GB	Fused: base + Opus reasoning LoRA baked in
gemma-4-E4B-opus-reasoning-claude-code-lora	658 MB	LoRA adapter only
gemma-4-E2B-Heretic-Uncensored-mlx-4bit	3.34 GB	2B abliterated (uncensored)
gemma-4-21b-REAP-Tool-Calling-mlx-4bit	12 GB	21B REAP-pruned MoE

License

Gemma Terms of Use — free for research and commercial use with attribution.

Built with 🖤 by RavenX AI · TurboQuant-MLX · Gemini CLI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention

Downloads last month: 411

Safetensors

Model size

2B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for deadbydawn101/gemma-4-E4B-mlx-4bit

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(241)

this model

Adapters

1 model

Finetunes

3 models

Dataset used to train deadbydawn101/gemma-4-E4B-mlx-4bit

Space using deadbydawn101/gemma-4-E4B-mlx-4bit 1

Collection including deadbydawn101/gemma-4-E4B-mlx-4bit

TurboQuant 4-bit mlx-lm models. TriAttention compatible. PR #1 merged MIT+NVIDIA. • 9 items • Updated 21 days ago • 1

URL: https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit

⇱ deadbydawn101/gemma-4-E4B-mlx-4bit · Hugging Face