VOOZH about

URL: https://huggingface.co/diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

⇱ diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1 · Hugging Face


Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

This model is a function-calling and reasoning-oriented fine-tune of Qwen/Qwen3-0.6B, trained with a 2-stage recipe:

  1. SFT on mixed tool-calling + reasoning data
  2. DPO to improve call-vs-no-call decisions

Model Summary

  • Base model: Qwen/Qwen3-0.6B
  • Method: LoRA SFT + LoRA DPO
  • Primary goal: improve tool invocation decisions and argument formatting while keeping concise reasoning behavior
  • Final artifacts in this repo:
    • lora_adapter/* (adapter weights)
    • merged_safetensors/* (merged full model)
    • merged_gguf/model-f16.gguf

Training Data

Training used the following public datasets:

  • Salesforce/xlam-function-calling-60k
  • nvidia/When2Call
  • Roman1111111/claude-opus-4.6-10000x
  • Crownelius/Opus-4.6-Reasoning-3300x

Local merged split sizes:

  • SFT train: 91,048
  • SFT val: 1,858
  • DPO train: 8,550
  • DPO val: 450

Training Procedure

Stage 1: SFT

  • LoRA: r=64, alpha=128, dropout=0.05, target_modules=all-linear
  • Sequence length: 16384
  • Epochs: 3
  • Per-device batch size: 2
  • Gradient accumulation: 16
  • LR: 2e-4 (cosine, warmup ratio 0.05)
  • bf16 + gradient checkpointing
  • Best SFT eval loss: 0.126719 (checkpoint-8500)

Stage 2: DPO

  • LoRA: r=32, alpha=64, dropout=0.05, target_modules=all-linear
  • beta=0.1, loss_type=sigmoid
  • max_length=1024, max_prompt_length=512
  • Per-device batch size: 1
  • Gradient accumulation: 32
  • LR: 5e-7 (cosine, warmup ratio 0.03)
  • bf16 + gradient checkpointing
  • precompute_ref_log_probs=true, precompute_ref_batch_size=1

Runtime Environment

  • GPU: NVIDIA GeForce RTX 5090 (31.37 GiB)
  • CUDA runtime (PyTorch): 12.8
  • PyTorch: 2.11.0+cu128
  • TRL: 1.0.0
  • PEFT: 0.18.1
  • Transformers: 5.5.1

Evaluation

Evaluation run date: 2026-04-12 (UTC)

Command:

python eval/eval_fc.py \
 --model_path ./outputs/final_weights/merged_safetensors \
 --eval_type all \
 --device cuda

Results:

  • Built-in function-calling checks (4 cases):
    • Format correctness: 3/4 (75%)
    • Tool selection: 3/4 (75%)
  • When2Call MCQ (sample=100 from test/mcq):
    • Accuracy: 67/100 (67%)

Usage

Transformers (merged model)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1"

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="merged_safetensors", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, subfolder="merged_safetensors", trust_remote_code=True)

LoRA adapter

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1", subfolder="lora_adapter")

Limitations

  • This model can still over-call tools when user input lacks required slots.
  • Built-in check includes only 4 handcrafted cases and should not be treated as a benchmark.
  • When2Call score is from a 100-sample quick evaluation, not full test set scoring.
  • Outputs may inherit bias and errors from source datasets and synthetic data.

License

This repository is released under Apache-2.0.
Please separately verify the licenses/terms of each training dataset and upstream base model before commercial use.

Downloads last month
68
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

6-bit

16-bit

Model tree for diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

Finetuned
Qwen/Qwen3-0.6B
Adapter
(426)
this model

Datasets used to train diverWayne/Qwen3-0.6B-ToolCalling-Claude-4.6-Opus-Distilled-v1

Evaluation results

  • Accuracy on When2Call MCQ (sample=100)
    self-reported
    0.670
  • Format correctness on Built-in Function-Calling Cases (n=4)
    self-reported
    0.750
  • Tool selection on Built-in Function-Calling Cases (n=4)
    self-reported
    0.750