VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/6.1-agentic-rl-overview

⇱ Agentic RL Overview | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Agentic RL Overview

Purpose and Scope

This page provides a high-level introduction to agentic reinforcement learning in AReaL: what it is, why it's useful, and how AReaL's architecture supports it. Agentic RL enables training language models that interact with tools, environments, or engage in multi-turn conversations by treating each interaction as part of a longer episode.

AReaL's design philosophy for agentic RL centers on unified training and deployment, allowing users to leverage the same agent orchestration code for both training and evaluation without modification. AReaL provides a specialized infrastructure for capturing trajectories from external runtimes like OpenClaw or ZeroClaw using an OpenAI-compatible proxy gateway examples/openclaw/README.md1-6

For implementation details, see:

Sources: examples/openclaw/README.md1-15 areal/experimental/openai/client.py12-71


What is Agentic RL?

Agentic RL refers to reinforcement learning for training agents that:

  1. Engage in multi-turn conversations where each turn builds on previous context areal/experimental/workflow/multi_turn_v2.py17-33
  2. Use tools (function calling, code execution, calculator) to accomplish tasks areal/experimental/openai/tool_call_parser.py61-72
  3. Interact with environments (terminals, browsers, external runtimes) over multiple steps examples/openclaw/README.md165-177
  4. Maintain state across interactions within an episode using session tracking areal/experimental/openai/proxy/server.py66-78

Traditional single-turn RL treats each prompt-response pair independently. In contrast, agentic RL treats a sequence of interactions as a single episode, where earlier actions affect later states and rewards may be sparse and assigned at the end of the session areal/experimental/openai/cache.py55-84

Key Characteristics

AspectTraditional RLAgentic RL
Episode StructureSingle prompt → responseMulti-turn conversation tree areal/experimental/openai/cache.py146-160
Reward SignalPer-response rewardSparse terminal rewards with backward propagation areal/experimental/openai/cache.py55-84
StateIndependent promptsConversation history via InteractionCache areal/experimental/openai/cache.py13-18
Use CasesMath problems, Q&ATool-use, Customer service, Search agents examples/tau2/config_8b_airline.yaml121-134

Sources: areal/experimental/openai/cache.py13-160 areal/experimental/workflow/multi_turn_v2.py17-33 examples/tau2/config_8b_airline.yaml121-134


AReaL's Approach to Agentic RL

AReaL supports three primary execution modes for agent workflows: inline, subproc, and online examples/openclaw/config.yaml34-40 In online mode, AReaL exposes a service that allows external agents to interact with the training loop via a standard API README.md31-34

High-Level Architecture

In online mode, AReaL provides a Proxy Gateway that acts as a drop-in replacement for OpenAI's API. This allows external runtimes like ZeroClaw or OpenClaw to interact with the model under training as if it were a standard production endpoint examples/openclaw/README.md3-6 README.md61-63


Diagram 1: Agentic RL Architecture - External applications connect to the Proxy Gateway which routes to workers to collect token-level data and store them in InteractionCache areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/cache.py13

Sources: areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/cache.py13 examples/openclaw/README.md1-6 README.md31-34


Key Components

Proxy Gateway & Worker

The gateway manages session lifecycles and authentication. It allows starting sessions via /rl/start_session and setting rewards via /rl/set_reward areal/experimental/openai/proxy/server.py179-181 The SessionData class tracks individual session status and access times areal/experimental/openai/proxy/server.py66-78

Interaction Tracking & Session Management

AReaL allocates a unique Session API Key for each trajectory areal/experimental/openai/proxy/server.py33-38 This key allows AReaL to differentiate trajectories from concurrent agent applications examples/openclaw/README.md112-124 The InteractionCache class handles the building of parent-child relationships between turns by matching prefixes of message histories areal/experimental/openai/cache.py112-160

Reward Assignment

Rewards are typically assigned at the end of an episode. AReaL supports assigning rewards to the last interaction or a specific interaction_id areal/experimental/openai/cache.py44-53 The apply_reward_discount method propagates rewards backward through the conversation tree using a geometric discount factor areal/experimental/openai/cache.py55-84

Advanced Workflow Scaffolding

AReaL supports modular agent execution via the Scaffolding framework, which decouples agent execution from reward calculation README.md43-51 This is particularly useful for complex environments like Terminal Bench README.md76-80 or customer service benchmarks like Tau2-Bench examples/tau2/README.md5-9

Sources: areal/experimental/openai/cache.py44-160 areal/experimental/openai/proxy/server.py33-181 examples/openclaw/README.md112-124 README.md43-80 examples/tau2/README.md5-9


Integration Workflow

Code Entity Mapping


Diagram 2: Code Entity Flow - Maps external API usage to internal infrastructure classes like InteractionCache areal/experimental/openai/cache.py13 and the underlying data type InteractionWithTokenLogpReward areal/experimental/openai/types.py36

Sources: areal/experimental/openai/cache.py13 areal/experimental/openai/types.py36 areal/experimental/openai/proxy/server.py179-181


Relationship to Core AReaL System

Agentic RL utilizes the same distributed backends as standard RL. The InteractionWithTokenLogpReward object is responsible for converting captured conversation turns into tensor_dict formats (including input_ids, logprobs, and loss_mask) that training engines like FSDP or Archon can process areal/experimental/openai/types.py143-195

Supported Algorithms

Agentic data can be optimized using:

Sources: areal/experimental/openai/types.py143-195 examples/tau2/config_8b_airline.yaml77-89 examples/tau2/README.md154-156


Supported Features

FeatureStatusImplementation Detail
OpenAI CompatibilityWraps AsyncOpenAI for reward and logprob tracking areal/experimental/openai/client.py12-66
Tool CallingIntegrated process_tool_calls via SGLang or vLLM parsers areal/experimental/openai/tool_call_parser.py61-206
Multi-turn ConversationsAutomatic tracking via prefix matching in InteractionCache areal/experimental/openai/cache.py112-160
Online ModeProxy gateway for external runtimes like ZeroClaw examples/openclaw/README.md42-50
Reward Assignmentset_reward and set_last_reward methods areal/experimental/openai/cache.py44-53
Session TrackingSessionData handles timeout and access lifecycle areal/experimental/openai/proxy/server.py66-89
Tree TrainingOptimized prefix sharing for multi-turn reasoning examples/tau2/README.md154-156

Sources: areal/experimental/openai/client.py12-66 areal/experimental/openai/cache.py44-160 areal/experimental/openai/proxy/server.py66-89 areal/experimental/openai/tool_call_parser.py61-206 examples/tau2/README.md154-156