Last indexed: 7 May 2026 (2e12c1)

Customer Service Agents (Tau2)

The Tau2-Bench integration in AReaL provides a specialized pipeline for training customer service agents in realistic, multi-turn simulation environments. These environments (retail, airline, telecom) require agents to navigate complex user requests by invoking tools and providing guidance examples/tau2/README.md5-9

System Overview

Training Tau2 agents involves coordination between the AReaL RL Trainer, a Proxy Rollout Server, and the Tau2 Simulation Environment. The environment utilizes an external User Simulator (typically a large LLM like Qwen2.5-72B) to interact with the agent being trained examples/tau2/README.md60-74

Data Flow and Interaction

The interaction cycle follows an agentic RL pattern where the agent's actions (text or tool calls) are captured as trajectories for optimization.

Workflow Initiation: The Tau2AgentWorkflow examples/tau2/README.md13-15 manages the simulation lifecycle by running tau2 simulations.
Inference Routing: Agent completions are routed through AReaL's self-hosted inference servers (SGLang or vLLM) via a proxy server that tracks log-probabilities and token usage for RL training examples/tau2/README.md15-18
Environment Feedback: The Tau2 environment processes agent tool calls and updates the simulation state, communicating with the user simulator via the configured user_llm_base_url examples/tau2/README.md118-120
Reward Calculation: At the end of a trajectory, a reward is assigned based on task success and efficiency, adjusted by an invalid_format_penalty examples/tau2/README.md122-123

Component Architecture (Natural Language to Code Entity)

The following diagram maps the conceptual simulation components to their implementation entities within the AReaL ecosystem.

Tau2 Training Architecture

Sources: examples/tau2/README.md11-21 examples/tau2/config_8b_airline.yaml122-133 examples/openclaw/README.md55-57

Implementation Details

Configuration Dataclasses

The implementation relies on core dataclasses to manage the environment and RL parameters:

Tau2EnvConfig: Defines the domain (airline, retail, telecom), maximum steps, and user simulator endpoints (user_llm_base_url) examples/tau2/README.md112-123
Tau2PPOConfig: Extends standard PPO configurations to include Tau2-specific settings via the econfig field examples/tau2/README.md19-21
Tau2RunInfo: A structure that tracks metadata for each simulation run, including reward information and trajectory details examples/tau2/README.md19-21

Key Functions and Workflow

The training script train.py initializes the RL trainer and passes the Tau2AgentWorkflow to the training loop examples/tau2/README.md13-15 It handles the loading of tau2 datasets and manages the training epoch cycle.

Inference and Optimization Flow

Sources: examples/tau2/README.md13-18 examples/camel/train.py91-103

Advanced Configurations

Multi-Domain and MoE Support

AReaL supports scaling Tau2 training to massive Mixture-of-Experts (MoE) models using the MegatronEngine or ArchonEngine backends with complex parallelism strategies examples/tau2/README.md53-58

Model	Engine	Allocation Pattern (`alloc_mode`)	Scale
Qwen3-1.7B	Archon	`sglang:d6+archon:d2`	1 Node examples/tau2/config_1.7b_airline.yaml31-52
Qwen3-8B	Archon	`sglang:d16+archon:d8`	3 Nodes examples/tau2/config_8b_airline.yaml31-52
Qwen3-30B-A3B	Megatron	`sglang:d8t4+megatron:(attn:d4p4t2\|ffn:d2p4e4)`	8 Nodes examples/tau2/config_30b_moe_airline.yaml57
Qwen3-235B-A22B	Megatron	`sglang:d4t8+megatron:(attn:d1p12t4c1\|ffn:d1p12t1e4)`	10 Nodes examples/tau2/config_235b_moe_airline.yaml58

Optimization Features

Tree Training: Enabled via enable_tree_training: true. This optimizes prefix computation for multi-turn interactions, significantly reducing redundant computation during the RL rollout phase examples/tau2/config_8b_airline.yaml89 examples/tau2/README.md154-156
Turn Discounting: Controlled by econfig.turn_discount. It allows the system to weigh earlier or later turns differently during reward backpropagation examples/tau2/config_8b_airline.yaml49 examples/camel/train.py101
Thinking Tool: The agent can be configured to use an explicit "thinking" tool before responding by setting econfig.add_thinking_tool: true examples/tau2/README.md116
Solo Mode: If econfig.solo_mode is true, the agent handles both agent and user roles, removing the need for an external user simulator examples/tau2/README.md117

Execution and Environment

Prerequisites

Training requires a specific forked version of tau2-bench that supports async completion and custom user simulators examples/tau2/README.md30-38

Resource Allocation

The backend strings in the configuration define how GPUs are partitioned between the inference engine (sglang) and the training engine (archon or megatron) examples/tau2/config_8b_airline.yaml31-52 For example, sglang:d16 for rollouts and archon:d8 for the actor allocates 16 GPUs for inference and 8 for training across the cluster nodes.

Sources: examples/tau2/README.md1-156 examples/tau2/config_8b_airline.yaml1-134 examples/camel/train.py1-136

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.8-customer-service-agents-(tau2)