Gemma 4 E4B — Tool-Calling v0.2

A production-grade tool-calling model built on Google's Gemma 4 E4B-it. Fine-tuned on 78K curated examples to reliably call functions, select the right tool, and know when not to call.

4B parameters · 95% multi-tool accuracy · Runs on a single GPU · Apache 2.0

This model retains all general capabilities of Gemma 4 E4B-it — conversation, reasoning, knowledge. Tool-calling was added as an additional skill through fine-tuning.

Benchmarks

👁 Comparison with other models

👁 BFCL Results Breakdown

BFCL Leaderboard (Official)

Category	Accuracy
Multiple (choose correct tool)	95.0%
Parallel (simultaneous calls)	90.0%
Simple Python	88.5%
Parallel Multiple	86.0%
Live Simple	79.8%
Simple JavaScript	74.0%
Live Multiple	74.2%
Live Parallel	68.8%
Non-Live Average	86.5%
Live Average	74.8%

BFCL Submission PR →

v0.1 → v0.2 Improvement

👁 v0.1 vs v0.2

Metric	v0.1	v0.2
Tool Selection	64%	94.4%
Full Match (name + args)	28%	~90%
No-Call Accuracy	70%	87.5%

Quick Start

Ollama (one command)

ollama run hf.co/roshangrewal/gemma4-e4b-toolcall-v02-gguf

vLLM (OpenAI-compatible API)

pip install vllm

vllm serve roshangrewal/gemma4-e4b-toolcall-v02 \
 --enable-auto-tool-choice \
 --tool-call-parser gemma4 \
 --max-model-len 8192 \
 --dtype float16

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
 model="roshangrewal/gemma4-e4b-toolcall-v02",
 messages=[{"role": "user", "content": "What's the weather in Mumbai?"}],
 tools=[{"type": "function", "function": {
 "name": "get_weather",
 "description": "Get weather for a city",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
 }}]
)

print(response.choices[0].message.tool_calls)

HuggingFace Transformers

from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch

model = AutoModelForMultimodalLM.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02")

tools = [{"type": "function", "function": {
 "name": "get_weather",
 "description": "Get current weather for a city",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}]

messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the weather in Mumbai?"}]
text = processor.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

LM Studio

Open LM Studio
Search roshangrewal/gemma4-e4b-toolcall-v02-gguf
Download and chat

Examples

Input	Output
"What's the weather in Mumbai?"	`call:get_weather{city:"Mumbai"}`
"Find flights from Delhi to NYC on Jan 15"	`call:search_flights{origin:"Delhi",destination:"NYC",date:"2025-01-15"}`
"What is Apple stock price?"	`call:get_stock_price{symbol:"AAPL"}`
"Transfer 5000 from ACC001 to ACC002"	`call:transfer_money{from_account:"ACC001",to_account:"ACC002",amount:5000}`
"Hi, how are you?"	"I'm doing great, thank you!" (no tool call)
"What is the capital of France?"	"Paris." (no tool call)

Available Formats

Format	Repo	Size	Best for
Full model (fp16)	This repo	~16 GB	vLLM, transformers
LoRA adapter	gemma4-e4b-toolcall-v02-lora	~680 MB	Further fine-tuning
GGUF Q8_0	gemma4-e4b-toolcall-v02-gguf	~8 GB	Ollama, llama.cpp, LM Studio

Training

Parameter	Value
Base model	google/gemma-4-E4B-it
Method	QLoRA (4-bit NF4) via Unsloth
LoRA rank	64 (alpha=128)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data	78,298 examples
Max sequence length	4,096 tokens (model supports 32K at inference)
Batch size	16 (effective: 1 x 16 grad accumulation)
Learning rate	2e-4 (cosine decay)
Warmup steps	250
Steps	5,000
Training time	~85 hours
Hardware	1x NVIDIA A100 80GB GPU
Optimizer	AdamW 8-bit
NEFTune noise	alpha=5.0
Packing	Enabled (multiple examples per sequence)
Gradient checkpointing	Enabled

Data Sources

Source	Examples	Content
NVIDIA Nemotron-SFT-Agentic-v2 (interactive_agent)	40,000	Multi-turn customer service with tool calls
NVIDIA Nemotron-SFT-Agentic-v2 (tool_calling)	817	Single/multi-step tool chains
Glaive function-calling-v2	2,378	Simple function calling pairs
Existing data (filtered & standardized)	2,143	Mixed tool-call conversations
Total (after splitting long examples)	78,298

All data standardized to OpenAI format (messages + tools). 80% multi-turn conversations (avg 11.3 messages per example).

Limitations

Outputs in Gemma 4 native format (vLLM auto-converts to JSON)
Best accuracy on sequences up to 4K tokens (trained length); still works up to 32K
87.5% accuracy on knowing when NOT to call (may occasionally over-trigger)
Not tested on non-English queries
4B model — less capable than 70B+ on complex multi-step reasoning

Citation

@misc{grewal2026gemma4toolcall,
 title={Gemma 4 E4B Tool-Calling v0.2},
 author={Roshan Grewal},
 year={2026},
 publisher={HuggingFace},
 url={https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v02}
}

Downloads last month: 259

Safetensors

Model size

8B params

Tensor type

F32

F16

Model tree for roshangrewal/gemma4-e4b-toolcall-v02

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it