Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

This model is a function-calling and reasoning-oriented fine-tune of Qwen/Qwen3-0.6B, trained with a 2-stage recipe:

SFT on mixed tool-calling + reasoning data
DPO to improve call-vs-no-call decisions

Model Summary

Base model: Qwen/Qwen3-0.6B
Method: LoRA SFT + LoRA DPO
Primary goal: improve tool invocation decisions and argument formatting while keeping concise reasoning behavior
Final artifacts in this repo:
- lora_adapter/* (adapter weights)
- merged_safetensors/* (merged full model)
- merged_gguf/model-f16.gguf

Training Data

Training used the following public datasets:

Salesforce/xlam-function-calling-60k
nvidia/When2Call
Roman1111111/claude-opus-4.6-10000x
Crownelius/Opus-4.6-Reasoning-3300x

Local merged split sizes:

SFT train: 91,048
SFT val: 1,858
DPO train: 8,550
DPO val: 450

Training Procedure

Stage 1: SFT

LoRA: r=64, alpha=128, dropout=0.05, target_modules=all-linear
Sequence length: 16384
Epochs: 3
Per-device batch size: 2
Gradient accumulation: 16
LR: 2e-4 (cosine, warmup ratio 0.05)
bf16 + gradient checkpointing
Best SFT eval loss: 0.126719 (checkpoint-8500)

Stage 2: DPO

LoRA: r=32, alpha=64, dropout=0.05, target_modules=all-linear
beta=0.1, loss_type=sigmoid
max_length=1024, max_prompt_length=512
Per-device batch size: 1
Gradient accumulation: 32
LR: 5e-7 (cosine, warmup ratio 0.03)
bf16 + gradient checkpointing
precompute_ref_log_probs=true, precompute_ref_batch_size=1

Runtime Environment

GPU: NVIDIA GeForce RTX 5090 (31.37 GiB)
CUDA runtime (PyTorch): 12.8
PyTorch: 2.11.0+cu128
TRL: 1.0.0
PEFT: 0.18.1
Transformers: 5.5.1

Evaluation

Evaluation run date: 2026-04-12 (UTC)

Command:

python eval/eval_fc.py \
 --model_path ./outputs/final_weights/merged_safetensors \
 --eval_type all \
 --device cuda

Results:

Built-in function-calling checks (4 cases):
- Format correctness: 3/4 (75%)
- Tool selection: 3/4 (75%)
When2Call MCQ (sample=100 from test/mcq):
- Accuracy: 67/100 (67%)

Usage

Transformers (merged model)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1"

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="merged_safetensors", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, subfolder="merged_safetensors", trust_remote_code=True)

LoRA adapter

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1", subfolder="lora_adapter")

Limitations

This model can still over-call tools when user input lacks required slots.
Built-in check includes only 4 handcrafted cases and should not be treated as a benchmark.
When2Call score is from a 100-sample quick evaluation, not full test set scoring.
Outputs may inherit bias and errors from source datasets and synthetic data.

License

This repository is released under Apache-2.0.
Please separately verify the licenses/terms of each training dataset and upstream base model before commercial use.

Downloads last month: 68

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

6-bit

16-bit

Model tree for diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Adapter

(426)

this model

Datasets used to train diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

Evaluation results

Accuracy on When2Call MCQ (sample=100)
self-reported
0.670
Format correctness on Built-in Function-Calling Cases (n=4)
self-reported
0.750
Tool selection on Built-in Function-Calling Cases (n=4)
self-reported
0.750

URL: https://huggingface.co/diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

⇱ diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1 · Hugging Face