RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

👁 arXiv
👁 Paper
👁 Model
👁 GitHub
👁 Project Page

Model Overview

The Qwen3-4B-RODS model is a high-performance Large Language Model (LLM) fine-tuned for complex, multi-turn Function Calling (FC) and agentic tool-use tasks. Built upon the Qwen3-4B-Instruct base model, it has been trained using the novel RODS (Reward-driven Online Data Synthesis) framework combined with GRPO reinforcement learning.

RODS closes the loop between RL training and data generation: it repurposes the progress reward variance as a zero-cost capability boundary detector, continuously synthesizes structurally isomorphic training data at the agent's learning frontier, and manages a dynamic replay buffer that co-evolves with the policy. Starting from only 400 human-annotated seeds, RODS achieves strong multi-turn tool-use performance with extreme data efficiency.

Base Model: Qwen3-4B-Instruct
Size: 4 Billion parameters
Key Capability: Advanced Multi-Turn Function Calling and Agentic Tool-Use

Evaluation Results

The model was evaluated on the Berkeley Function-Calling Leaderboard (BFCL).

BFCLv3 Multi-Turn Performance

Model	Size	Multi-Turn (Overall)	Base	Miss Func	Miss Param	Long Context
Qwen3-4B-Instruct (Base)	4B	22.13	26.50	21.00	15.50	25.50
Qwen3-4B + RODS (ours)	4B	56.00	68.00	59.00	44.00	53.00
Claude-Sonnet-4-5-20250929	-	61.38	69.00	65.00	52.50	59.00
Grok-4-1-fast-reasoning	-	58.88	70.50	59.50	43.00	62.50
Kimi-K2-Instruct	1043B	50.63	62.00	41.00	44.50	55.00
Qwen3-32B	32B	47.88	56.00	52.50	40.00	43.00
DeepSeek-V3.2-Exp	671B	44.88	55.00	49.00	27.00	48.50
GPT-4o-2024-11-20	-	42.50	55.50	34.50	29.00	51.00

Training Data and Framework

RODS Framework

RODS is a closed-loop RL-data synthesis framework with three co-evolving modules:

Reward-Based Boundary Detection: Uses GRPO rollout reward variance as a zero-cost probe to identify tasks at the agent's capability boundary, where gradient signal is richest.
Skill-Aligned Synthesis Pipeline: A multi-agent pipeline (Planner → Executor → Rewriter → Critic) generates structurally isomorphic variants that preserve API topology and dependency depth while introducing novel narratives and environment states.
Dynamic Replay Buffer Management: A dual-control lifecycle with staged injection and multi-layer retirement keeps the training pool anchored at the shifting capability boundary.

Training Details

Method: GRPO (Group Relative Policy Optimization)
Rollouts: K=16 per prompt
Training stages:
1. Format training (100 Base samples, format reward)
2. Base reasoning (100 Base samples, progress reward)
3. Full expansion (400 samples + dynamic synthesis, progress reward)
Synthesis backbone: Qwen3-32B via vLLM
Hardware: 8x A100 (training) + 8x A100 (synthesis)
Active training pool: ~800 samples (400 seeds + up to 400 generated)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RuishanFang/Qwen3-4B-RODS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

For tool-use inference, follow the Qwen3 function calling format. The model expects tools to be provided in the system prompt and generates structured <tool_call> responses.

Related Projects and Citation

This work is part of the open-source project AWorld, InclusionAI.

If you use RODS in your research, please cite:

@article{fang2026rods,
 title={RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents},
 author={Fang, Ruishan and Lu, Siyuan and Zhuang, Chenyi and Lin, Tao},
 journal={arXiv preprint arXiv:2606.19047},
 year={2026}
}

Contact

For inquiries, please contact:

fangruishan@westlake.edu.cn

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for RuishanFang/Qwen3-4B-RODS

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(722)

this model

Dataset used to train RuishanFang/Qwen3-4B-RODS

Paper for RuishanFang/Qwen3-4B-RODS

Paper • 2606.19047 • Published 1 day ago

Evaluation results

Overall Accuracy on BFCL V3 Multi-Turn
self-reported
56.000

URL: https://huggingface.co/RuishanFang/Qwen3-4B-RODS

⇱ RuishanFang/Qwen3-4B-RODS · Hugging Face