Gemma 4 E4B — Tool-Calling v0.2
A production-grade tool-calling model built on Google's Gemma 4 E4B-it. Fine-tuned on 78K curated examples to reliably call functions, select the right tool, and know when not to call.
4B parameters · 95% multi-tool accuracy · Runs on a single GPU · Apache 2.0
This model retains all general capabilities of Gemma 4 E4B-it — conversation, reasoning, knowledge. Tool-calling was added as an additional skill through fine-tuning.
Benchmarks
👁 Comparison with other models
BFCL Leaderboard (Official)
| Category | Accuracy |
|---|---|
| Multiple (choose correct tool) | 95.0% |
| Parallel (simultaneous calls) | 90.0% |
| Simple Python | 88.5% |
| Parallel Multiple | 86.0% |
| Live Simple | 79.8% |
| Simple JavaScript | 74.0% |
| Live Multiple | 74.2% |
| Live Parallel | 68.8% |
| Non-Live Average | 86.5% |
| Live Average | 74.8% |
v0.1 → v0.2 Improvement
| Metric | v0.1 | v0.2 |
|---|---|---|
| Tool Selection | 64% | 94.4% |
| Full Match (name + args) | 28% | ~90% |
| No-Call Accuracy | 70% | 87.5% |
Quick Start
Ollama (one command)
ollama run hf.co/roshangrewal/gemma4-e4b-toolcall-v02-gguf
vLLM (OpenAI-compatible API)
pip install vllm
vllm serve roshangrewal/gemma4-e4b-toolcall-v02 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--max-model-len 8192 \
--dtype float16
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="roshangrewal/gemma4-e4b-toolcall-v02",
messages=[{"role": "user", "content": "What's the weather in Mumbai?"}],
tools=[{"type": "function", "function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}]
)
print(response.choices[0].message.tool_calls)
HuggingFace Transformers
from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch
model = AutoModelForMultimodalLM.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02")
tools = [{"type": "function", "function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}]
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the weather in Mumbai?"}]
text = processor.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
LM Studio
- Open LM Studio
- Search
roshangrewal/gemma4-e4b-toolcall-v02-gguf - Download and chat
Examples
| Input | Output |
|---|---|
| "What's the weather in Mumbai?" | call:get_weather{city:"Mumbai"} |
| "Find flights from Delhi to NYC on Jan 15" | call:search_flights{origin:"Delhi",destination:"NYC",date:"2025-01-15"} |
| "What is Apple stock price?" | call:get_stock_price{symbol:"AAPL"} |
| "Transfer 5000 from ACC001 to ACC002" | call:transfer_money{from_account:"ACC001",to_account:"ACC002",amount:5000} |
| "Hi, how are you?" | "I'm doing great, thank you!" (no tool call) |
| "What is the capital of France?" | "Paris." (no tool call) |
Available Formats
| Format | Repo | Size | Best for |
|---|---|---|---|
| Full model (fp16) | This repo | ~16 GB | vLLM, transformers |
| LoRA adapter | gemma4-e4b-toolcall-v02-lora | ~680 MB | Further fine-tuning |
| GGUF Q8_0 | gemma4-e4b-toolcall-v02-gguf | ~8 GB | Ollama, llama.cpp, LM Studio |
Training
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-E4B-it |
| Method | QLoRA (4-bit NF4) via Unsloth |
| LoRA rank | 64 (alpha=128) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training data | 78,298 examples |
| Max sequence length | 4,096 tokens (model supports 32K at inference) |
| Batch size | 16 (effective: 1 x 16 grad accumulation) |
| Learning rate | 2e-4 (cosine decay) |
| Warmup steps | 250 |
| Steps | 5,000 |
| Training time | ~85 hours |
| Hardware | 1x NVIDIA A100 80GB GPU |
| Optimizer | AdamW 8-bit |
| NEFTune noise | alpha=5.0 |
| Packing | Enabled (multiple examples per sequence) |
| Gradient checkpointing | Enabled |
Data Sources
| Source | Examples | Content |
|---|---|---|
| NVIDIA Nemotron-SFT-Agentic-v2 (interactive_agent) | 40,000 | Multi-turn customer service with tool calls |
| NVIDIA Nemotron-SFT-Agentic-v2 (tool_calling) | 817 | Single/multi-step tool chains |
| Glaive function-calling-v2 | 2,378 | Simple function calling pairs |
| Existing data (filtered & standardized) | 2,143 | Mixed tool-call conversations |
| Total (after splitting long examples) | 78,298 |
All data standardized to OpenAI format (messages + tools). 80% multi-turn conversations (avg 11.3 messages per example).
Limitations
- Outputs in Gemma 4 native format (vLLM auto-converts to JSON)
- Best accuracy on sequences up to 4K tokens (trained length); still works up to 32K
- 87.5% accuracy on knowing when NOT to call (may occasionally over-trigger)
- Not tested on non-English queries
- 4B model — less capable than 70B+ on complex multi-step reasoning
Citation
@misc{grewal2026gemma4toolcall,
title={Gemma 4 E4B Tool-Calling v0.2},
author={Roshan Grewal},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v02}
}
- Downloads last month
- 259
Safetensors
Model size
8B params
Tensor type
F32
·
F16 ·
