RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents
๐ arXiv
๐ Paper
๐ Model
๐ GitHub
๐ Project Page
Model Overview
The Qwen3-4B-RODS model is a high-performance Large Language Model (LLM) fine-tuned for complex, multi-turn Function Calling (FC) and agentic tool-use tasks. Built upon the Qwen3-4B-Instruct base model, it has been trained using the novel RODS (Reward-driven Online Data Synthesis) framework combined with GRPO reinforcement learning.
RODS closes the loop between RL training and data generation: it repurposes the progress reward variance as a zero-cost capability boundary detector, continuously synthesizes structurally isomorphic training data at the agent's learning frontier, and manages a dynamic replay buffer that co-evolves with the policy. Starting from only 400 human-annotated seeds, RODS achieves strong multi-turn tool-use performance with extreme data efficiency.
- Base Model: Qwen3-4B-Instruct
- Size: 4 Billion parameters
- Key Capability: Advanced Multi-Turn Function Calling and Agentic Tool-Use
Evaluation Results
The model was evaluated on the Berkeley Function-Calling Leaderboard (BFCL).
BFCLv3 Multi-Turn Performance
| Model | Size | Multi-Turn (Overall) | Base | Miss Func | Miss Param | Long Context |
|---|---|---|---|---|---|---|
| Qwen3-4B-Instruct (Base) | 4B | 22.13 | 26.50 | 21.00 | 15.50 | 25.50 |
| Qwen3-4B + RODS (ours) | 4B | 56.00 | 68.00 | 59.00 | 44.00 | 53.00 |
| Claude-Sonnet-4-5-20250929 | - | 61.38 | 69.00 | 65.00 | 52.50 | 59.00 |
| Grok-4-1-fast-reasoning | - | 58.88 | 70.50 | 59.50 | 43.00 | 62.50 |
| Kimi-K2-Instruct | 1043B | 50.63 | 62.00 | 41.00 | 44.50 | 55.00 |
| Qwen3-32B | 32B | 47.88 | 56.00 | 52.50 | 40.00 | 43.00 |
| DeepSeek-V3.2-Exp | 671B | 44.88 | 55.00 | 49.00 | 27.00 | 48.50 |
| GPT-4o-2024-11-20 | - | 42.50 | 55.50 | 34.50 | 29.00 | 51.00 |
Training Data and Framework
RODS Framework
RODS is a closed-loop RL-data synthesis framework with three co-evolving modules:
- Reward-Based Boundary Detection: Uses GRPO rollout reward variance as a zero-cost probe to identify tasks at the agent's capability boundary, where gradient signal is richest.
- Skill-Aligned Synthesis Pipeline: A multi-agent pipeline (Planner โ Executor โ Rewriter โ Critic) generates structurally isomorphic variants that preserve API topology and dependency depth while introducing novel narratives and environment states.
- Dynamic Replay Buffer Management: A dual-control lifecycle with staged injection and multi-layer retirement keeps the training pool anchored at the shifting capability boundary.
Training Details
- Method: GRPO (Group Relative Policy Optimization)
- Rollouts: K=16 per prompt
- Training stages:
- Format training (100 Base samples, format reward)
- Base reasoning (100 Base samples, progress reward)
- Full expansion (400 samples + dynamic synthesis, progress reward)
- Synthesis backbone: Qwen3-32B via vLLM
- Hardware: 8x A100 (training) + 8x A100 (synthesis)
- Active training pool: ~800 samples (400 seeds + up to 400 generated)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "RuishanFang/Qwen3-4B-RODS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
For tool-use inference, follow the Qwen3 function calling format. The model expects tools to be provided in the system prompt and generates structured <tool_call> responses.
Related Projects and Citation
This work is part of the open-source project AWorld, InclusionAI.
If you use RODS in your research, please cite:
@article{fang2026rods,
title={RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents},
author={Fang, Ruishan and Lu, Siyuan and Zhuang, Chenyi and Lin, Tao},
journal={arXiv preprint arXiv:2606.19047},
year={2026}
}
Contact
For inquiries, please contact:
fangruishan@westlake.edu.cn
- Downloads last month
- 2
Model tree for RuishanFang/Qwen3-4B-RODS
Dataset used to train RuishanFang/Qwen3-4B-RODS
Paper for RuishanFang/Qwen3-4B-RODS
Evaluation results
- Overall Accuracy on BFCL V3 Multi-Turnself-reported56.000
