VOOZH about

URL: https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v02

⇱ roshangrewal/gemma4-e4b-toolcall-v02 · Hugging Face


Gemma 4 E4B — Tool-Calling v0.2

A production-grade tool-calling model built on Google's Gemma 4 E4B-it. Fine-tuned on 78K curated examples to reliably call functions, select the right tool, and know when not to call.

4B parameters · 95% multi-tool accuracy · Runs on a single GPU · Apache 2.0

This model retains all general capabilities of Gemma 4 E4B-it — conversation, reasoning, knowledge. Tool-calling was added as an additional skill through fine-tuning.


Benchmarks

👁 Comparison with other models

👁 BFCL Results Breakdown

BFCL Leaderboard (Official)

Category Accuracy
Multiple (choose correct tool) 95.0%
Parallel (simultaneous calls) 90.0%
Simple Python 88.5%
Parallel Multiple 86.0%
Live Simple 79.8%
Simple JavaScript 74.0%
Live Multiple 74.2%
Live Parallel 68.8%
Non-Live Average 86.5%
Live Average 74.8%

BFCL Submission PR →

v0.1 → v0.2 Improvement

👁 v0.1 vs v0.2

Metric v0.1 v0.2
Tool Selection 64% 94.4%
Full Match (name + args) 28% ~90%
No-Call Accuracy 70% 87.5%

Quick Start

Ollama (one command)

ollama run hf.co/roshangrewal/gemma4-e4b-toolcall-v02-gguf

vLLM (OpenAI-compatible API)

pip install vllm

vllm serve roshangrewal/gemma4-e4b-toolcall-v02 \
 --enable-auto-tool-choice \
 --tool-call-parser gemma4 \
 --max-model-len 8192 \
 --dtype float16
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
 model="roshangrewal/gemma4-e4b-toolcall-v02",
 messages=[{"role": "user", "content": "What's the weather in Mumbai?"}],
 tools=[{"type": "function", "function": {
 "name": "get_weather",
 "description": "Get weather for a city",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
 }}]
)

print(response.choices[0].message.tool_calls)

HuggingFace Transformers

from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch

model = AutoModelForMultimodalLM.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v02")

tools = [{"type": "function", "function": {
 "name": "get_weather",
 "description": "Get current weather for a city",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}]

messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the weather in Mumbai?"}]
text = processor.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

LM Studio

  1. Open LM Studio
  2. Search roshangrewal/gemma4-e4b-toolcall-v02-gguf
  3. Download and chat

Examples

Input Output
"What's the weather in Mumbai?" call:get_weather{city:"Mumbai"}
"Find flights from Delhi to NYC on Jan 15" call:search_flights{origin:"Delhi",destination:"NYC",date:"2025-01-15"}
"What is Apple stock price?" call:get_stock_price{symbol:"AAPL"}
"Transfer 5000 from ACC001 to ACC002" call:transfer_money{from_account:"ACC001",to_account:"ACC002",amount:5000}
"Hi, how are you?" "I'm doing great, thank you!" (no tool call)
"What is the capital of France?" "Paris." (no tool call)

Available Formats

Format Repo Size Best for
Full model (fp16) This repo ~16 GB vLLM, transformers
LoRA adapter gemma4-e4b-toolcall-v02-lora ~680 MB Further fine-tuning
GGUF Q8_0 gemma4-e4b-toolcall-v02-gguf ~8 GB Ollama, llama.cpp, LM Studio

Training

Parameter Value
Base model google/gemma-4-E4B-it
Method QLoRA (4-bit NF4) via Unsloth
LoRA rank 64 (alpha=128)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data 78,298 examples
Max sequence length 4,096 tokens (model supports 32K at inference)
Batch size 16 (effective: 1 x 16 grad accumulation)
Learning rate 2e-4 (cosine decay)
Warmup steps 250
Steps 5,000
Training time ~85 hours
Hardware 1x NVIDIA A100 80GB GPU
Optimizer AdamW 8-bit
NEFTune noise alpha=5.0
Packing Enabled (multiple examples per sequence)
Gradient checkpointing Enabled

Data Sources

Source Examples Content
NVIDIA Nemotron-SFT-Agentic-v2 (interactive_agent) 40,000 Multi-turn customer service with tool calls
NVIDIA Nemotron-SFT-Agentic-v2 (tool_calling) 817 Single/multi-step tool chains
Glaive function-calling-v2 2,378 Simple function calling pairs
Existing data (filtered & standardized) 2,143 Mixed tool-call conversations
Total (after splitting long examples) 78,298

All data standardized to OpenAI format (messages + tools). 80% multi-turn conversations (avg 11.3 messages per example).


Limitations

  • Outputs in Gemma 4 native format (vLLM auto-converts to JSON)
  • Best accuracy on sequences up to 4K tokens (trained length); still works up to 32K
  • 87.5% accuracy on knowing when NOT to call (may occasionally over-trigger)
  • Not tested on non-English queries
  • 4B model — less capable than 70B+ on complex multi-step reasoning

Citation

@misc{grewal2026gemma4toolcall,
 title={Gemma 4 E4B Tool-Calling v0.2},
 author={Roshan Grewal},
 year={2026},
 publisher={HuggingFace},
 url={https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v02}
}
Downloads last month
259
Safetensors
Model size
8B params
Tensor type
F32
·
F16
·

Model tree for roshangrewal/gemma4-e4b-toolcall-v02

Finetuned
(221)
this model

Datasets used to train roshangrewal/gemma4-e4b-toolcall-v02