Last indexed: 7 May 2026 (2e12c1)

Agentic RL Overview

Purpose and Scope

This page provides a high-level introduction to agentic reinforcement learning in AReaL: what it is, why it's useful, and how AReaL's architecture supports it. Agentic RL enables training language models that interact with tools, environments, or engage in multi-turn conversations by treating each interaction as part of a longer episode.

AReaL's design philosophy for agentic RL centers on unified training and deployment, allowing users to leverage the same agent orchestration code for both training and evaluation without modification. AReaL provides a specialized infrastructure for capturing trajectories from external runtimes like OpenClaw or ZeroClaw using an OpenAI-compatible proxy gateway examples/openclaw/README.md1-6

For implementation details, see:

OpenAI-compatible client implementation: 6.2 ArealOpenAI Client
Conversation tracking mechanism: 6.3 InteractionCache and Session Tracking
Multi-turn conversation handling: 6.4 Multi-turn Conversations
Reward assignment and discounting: 6.5 Reward Assignment and Discounting
Deployment as a service: 6.6 Proxy Server Architecture

Sources: examples/openclaw/README.md1-15 areal/experimental/openai/client.py12-71

What is Agentic RL?

Agentic RL refers to reinforcement learning for training agents that:

Engage in multi-turn conversations where each turn builds on previous context areal/experimental/workflow/multi_turn_v2.py17-33
Use tools (function calling, code execution, calculator) to accomplish tasks areal/experimental/openai/tool_call_parser.py61-72
Interact with environments (terminals, browsers, external runtimes) over multiple steps examples/openclaw/README.md165-177
Maintain state across interactions within an episode using session tracking areal/experimental/openai/proxy/server.py66-78

Traditional single-turn RL treats each prompt-response pair independently. In contrast, agentic RL treats a sequence of interactions as a single episode, where earlier actions affect later states and rewards may be sparse and assigned at the end of the session areal/experimental/openai/cache.py55-84

Key Characteristics

Aspect	Traditional RL	Agentic RL
Episode Structure	Single prompt → response	Multi-turn conversation tree areal/experimental/openai/cache.py146-160
Reward Signal	Per-response reward	Sparse terminal rewards with backward propagation areal/experimental/openai/cache.py55-84
State	Independent prompts	Conversation history via `InteractionCache` areal/experimental/openai/cache.py13-18
Use Cases	Math problems, Q&A	Tool-use, Customer service, Search agents examples/tau2/config_8b_airline.yaml121-134

Sources: areal/experimental/openai/cache.py13-160 areal/experimental/workflow/multi_turn_v2.py17-33 examples/tau2/config_8b_airline.yaml121-134

AReaL's Approach to Agentic RL

AReaL supports three primary execution modes for agent workflows: inline, subproc, and online examples/openclaw/config.yaml34-40 In online mode, AReaL exposes a service that allows external agents to interact with the training loop via a standard API README.md31-34

High-Level Architecture

In online mode, AReaL provides a Proxy Gateway that acts as a drop-in replacement for OpenAI's API. This allows external runtimes like ZeroClaw or OpenClaw to interact with the model under training as if it were a standard production endpoint examples/openclaw/README.md3-6 README.md61-63

Diagram 1: Agentic RL Architecture - External applications connect to the Proxy Gateway which routes to workers to collect token-level data and store them in InteractionCache areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/cache.py13

Sources: areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/cache.py13 examples/openclaw/README.md1-6 README.md31-34

Key Components

Proxy Gateway & Worker

The gateway manages session lifecycles and authentication. It allows starting sessions via /rl/start_session and setting rewards via /rl/set_reward areal/experimental/openai/proxy/server.py179-181 The SessionData class tracks individual session status and access times areal/experimental/openai/proxy/server.py66-78

Interaction Tracking & Session Management

AReaL allocates a unique Session API Key for each trajectory areal/experimental/openai/proxy/server.py33-38 This key allows AReaL to differentiate trajectories from concurrent agent applications examples/openclaw/README.md112-124 The InteractionCache class handles the building of parent-child relationships between turns by matching prefixes of message histories areal/experimental/openai/cache.py112-160

Reward Assignment

Rewards are typically assigned at the end of an episode. AReaL supports assigning rewards to the last interaction or a specific interaction_id areal/experimental/openai/cache.py44-53 The apply_reward_discount method propagates rewards backward through the conversation tree using a geometric discount factor areal/experimental/openai/cache.py55-84

Advanced Workflow Scaffolding

AReaL supports modular agent execution via the Scaffolding framework, which decouples agent execution from reward calculation README.md43-51 This is particularly useful for complex environments like Terminal Bench README.md76-80 or customer service benchmarks like Tau2-Bench examples/tau2/README.md5-9

Sources: areal/experimental/openai/cache.py44-160 areal/experimental/openai/proxy/server.py33-181 examples/openclaw/README.md112-124 README.md43-80 examples/tau2/README.md5-9

Integration Workflow

Code Entity Mapping

Diagram 2: Code Entity Flow - Maps external API usage to internal infrastructure classes like InteractionCache areal/experimental/openai/cache.py13 and the underlying data type InteractionWithTokenLogpReward areal/experimental/openai/types.py36

Sources: areal/experimental/openai/cache.py13 areal/experimental/openai/types.py36 areal/experimental/openai/proxy/server.py179-181

Relationship to Core AReaL System

Agentic RL utilizes the same distributed backends as standard RL. The InteractionWithTokenLogpReward object is responsible for converting captured conversation turns into tensor_dict formats (including input_ids, logprobs, and loss_mask) that training engines like FSDP or Archon can process areal/experimental/openai/types.py143-195

Supported Algorithms

Agentic data can be optimized using:

PPO: Standard actor-critic for multi-turn episodes examples/tau2/config_8b_airline.yaml77-80
GRPO: Group-based optimization, supported by exporting interactions in specific styles examples/tau2/config_235b_moe_airline.yaml1-2
Tree Training: Prefix-sharing optimization for multi-turn sequences, supported in Archon and FSDP examples/tau2/config_8b_airline.yaml89 examples/tau2/README.md154-156

Sources: areal/experimental/openai/types.py143-195 examples/tau2/config_8b_airline.yaml77-89 examples/tau2/README.md154-156

Supported Features

Feature	Status	Implementation Detail
OpenAI Compatibility	✅	Wraps `AsyncOpenAI` for reward and logprob tracking areal/experimental/openai/client.py12-66
Tool Calling	✅	Integrated `process_tool_calls` via SGLang or vLLM parsers areal/experimental/openai/tool_call_parser.py61-206
Multi-turn Conversations	✅	Automatic tracking via prefix matching in `InteractionCache` areal/experimental/openai/cache.py112-160
Online Mode	✅	Proxy gateway for external runtimes like ZeroClaw examples/openclaw/README.md42-50
Reward Assignment	✅	`set_reward` and `set_last_reward` methods areal/experimental/openai/cache.py44-53
Session Tracking	✅	`SessionData` handles timeout and access lifecycle areal/experimental/openai/proxy/server.py66-89
Tree Training	✅	Optimized prefix sharing for multi-turn reasoning examples/tau2/README.md154-156

Sources: areal/experimental/openai/client.py12-66 areal/experimental/openai/cache.py44-160 areal/experimental/openai/proxy/server.py66-89 areal/experimental/openai/tool_call_parser.py61-206 examples/tau2/README.md154-156

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/6.1-agentic-rl-overview