VOOZH about

URL: https://huggingface.co/RuishanFang/Qwen3-4B-RODS

โ‡ฑ RuishanFang/Qwen3-4B-RODS ยท Hugging Face


RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

๐Ÿ‘ arXiv
๐Ÿ‘ Paper
๐Ÿ‘ Model
๐Ÿ‘ GitHub
๐Ÿ‘ Project Page

Model Overview

The Qwen3-4B-RODS model is a high-performance Large Language Model (LLM) fine-tuned for complex, multi-turn Function Calling (FC) and agentic tool-use tasks. Built upon the Qwen3-4B-Instruct base model, it has been trained using the novel RODS (Reward-driven Online Data Synthesis) framework combined with GRPO reinforcement learning.

RODS closes the loop between RL training and data generation: it repurposes the progress reward variance as a zero-cost capability boundary detector, continuously synthesizes structurally isomorphic training data at the agent's learning frontier, and manages a dynamic replay buffer that co-evolves with the policy. Starting from only 400 human-annotated seeds, RODS achieves strong multi-turn tool-use performance with extreme data efficiency.

  • Base Model: Qwen3-4B-Instruct
  • Size: 4 Billion parameters
  • Key Capability: Advanced Multi-Turn Function Calling and Agentic Tool-Use

Evaluation Results

The model was evaluated on the Berkeley Function-Calling Leaderboard (BFCL).

BFCLv3 Multi-Turn Performance

Model Size Multi-Turn (Overall) Base Miss Func Miss Param Long Context
Qwen3-4B-Instruct (Base) 4B 22.13 26.50 21.00 15.50 25.50
Qwen3-4B + RODS (ours) 4B 56.00 68.00 59.00 44.00 53.00
Claude-Sonnet-4-5-20250929 - 61.38 69.00 65.00 52.50 59.00
Grok-4-1-fast-reasoning - 58.88 70.50 59.50 43.00 62.50
Kimi-K2-Instruct 1043B 50.63 62.00 41.00 44.50 55.00
Qwen3-32B 32B 47.88 56.00 52.50 40.00 43.00
DeepSeek-V3.2-Exp 671B 44.88 55.00 49.00 27.00 48.50
GPT-4o-2024-11-20 - 42.50 55.50 34.50 29.00 51.00

Training Data and Framework

RODS Framework

RODS is a closed-loop RL-data synthesis framework with three co-evolving modules:

  1. Reward-Based Boundary Detection: Uses GRPO rollout reward variance as a zero-cost probe to identify tasks at the agent's capability boundary, where gradient signal is richest.
  2. Skill-Aligned Synthesis Pipeline: A multi-agent pipeline (Planner โ†’ Executor โ†’ Rewriter โ†’ Critic) generates structurally isomorphic variants that preserve API topology and dependency depth while introducing novel narratives and environment states.
  3. Dynamic Replay Buffer Management: A dual-control lifecycle with staged injection and multi-layer retirement keeps the training pool anchored at the shifting capability boundary.

Training Details

  • Method: GRPO (Group Relative Policy Optimization)
  • Rollouts: K=16 per prompt
  • Training stages:
    1. Format training (100 Base samples, format reward)
    2. Base reasoning (100 Base samples, progress reward)
    3. Full expansion (400 samples + dynamic synthesis, progress reward)
  • Synthesis backbone: Qwen3-32B via vLLM
  • Hardware: 8x A100 (training) + 8x A100 (synthesis)
  • Active training pool: ~800 samples (400 seeds + up to 400 generated)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RuishanFang/Qwen3-4B-RODS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

For tool-use inference, follow the Qwen3 function calling format. The model expects tools to be provided in the system prompt and generates structured <tool_call> responses.


Related Projects and Citation

This work is part of the open-source project AWorld, InclusionAI.

If you use RODS in your research, please cite:

@article{fang2026rods,
 title={RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents},
 author={Fang, Ruishan and Lu, Siyuan and Zhuang, Chenyi and Lin, Tao},
 journal={arXiv preprint arXiv:2606.19047},
 year={2026}
}

Contact

For inquiries, please contact:

  • fangruishan@westlake.edu.cn
Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
ยท

Model tree for RuishanFang/Qwen3-4B-RODS

Finetuned
Qwen/Qwen3-4B
Finetuned
(722)
this model

Dataset used to train RuishanFang/Qwen3-4B-RODS

Paper for RuishanFang/Qwen3-4B-RODS

Evaluation results