VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/6-agentic-rl-integration

⇱ Agentic RL Integration | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Agentic RL Integration

AReaL's agentic RL system enables training of agent frameworks (OpenAI Agents SDK, CAMEL-AI, Claude SDK, LangChain) using reinforcement learning, while maintaining full token-level tracking and reward propagation. The system bridges the gap between agent frameworks designed for inference and the requirements of RL training.

This page provides an overview of the agentic RL infrastructure. For detailed information on specific components:

The Problem: Agent Frameworks Meet RL Training

Agent frameworks are designed for inference and lack three critical features for RL training:

  1. No token-level access: Frameworks use high-level APIs (e.g., OpenAI's chat completion API) that do not expose token IDs and log probabilities required for computing policy gradients. areal/experimental/openai/client.py65-67
  2. No reward mechanism: Frameworks have no built-in reward functions. RL training requires reward signals assigned to specific model outputs. areal/experimental/openai/cache.py44-53
  3. Limited parallelization: Standard agent usage involves sequential execution, making it difficult to efficiently collect diverse trajectories needed for RL training. areal/experimental/openai/proxy/proxy_rollout_server.py98-102

Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/cache.py44-53 areal/experimental/openai/proxy/proxy_rollout_server.py98-102

AReaL's Solution: Transparent Tracking Layer

AReaL solves these problems by intercepting LLM API calls and maintaining a complete interaction history through a specialized tracking layer.

System Architecture Diagram:


Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/cache.py13-112 areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/types.py143-194

Integration Paradigms

AReaL supports execution modes for agent workflows through several approaches:

Inline/Subprocess Integration (Recommended)

The agent runs within the AReaL rollout worker environment. The MultiTurnWorkflow uses ArealOpenAI directly to manage multi-turn episodes, including reflection messages and turn discounting. areal/experimental/workflow/multi_turn_v2.py17-43


Sources: areal/experimental/workflow/multi_turn_v2.py44-96 areal/experimental/openai/client.py73-75

Online Mode (External)

External applications interact with AReaL via a Proxy Gateway. The system manages sessions via StartSessionRequest and tracks interaction history in SessionData. areal/experimental/openai/proxy/server.py26-30 areal/experimental/openai/proxy/server.py66-78

Sources: areal/experimental/openai/proxy/server.py26-30 areal/experimental/openai/proxy/server.py66-78

Core Components

The agentic RL system consists of several specialized components that bridge the gap between "Natural Language Space" (APIs) and "Code Entity Space" (Training Tensors).

Component Interaction Flow


Sources: areal/experimental/openai/client.py54-58 areal/experimental/openai/cache.py13-41 areal/experimental/openai/types.py35-58

1. ArealOpenAI Client

Extends the standard AsyncOpenAI client to capture ModelResponse data, including token IDs and log probabilities. It handles the mapping between OpenAI's chat format and the underlying inference engine requests. areal/experimental/openai/client.py65-76

Key class: ArealOpenAI in areal/experimental/openai/client.py areal/experimental/openai/client.py47

2. InteractionCache

A specialized OrderedDict that stores InteractionWithTokenLogpReward objects. It automatically constructs parent-child relationships by comparing message history prefixes, enabling the reconstruction of conversation trees for multi-turn RL. areal/experimental/openai/cache.py13-112

Key class: InteractionCache in areal/experimental/openai/cache.py areal/experimental/openai/cache.py13

3. Reward Assignment and Discounting

Rewards can be assigned to specific interactions using set_reward or the most recent completion via set_last_reward. The cache supports backward propagation of rewards through the conversation tree using a turn_discount factor. areal/experimental/openai/cache.py44-54 areal/experimental/openai/cache.py55-84

Key methods: set_reward(), apply_reward_discount() in areal/experimental/openai/cache.py areal/experimental/openai/cache.py44 areal/experimental/openai/cache.py55

4. Tool Call Integration

The system includes parsers to extract tool calls from raw model text and convert them into structured tool call objects compatible with agent frameworks. It supports both SGLang and vLLM parser logic. areal/experimental/openai/tool_call_parser.py61-156

Key function: process_tool_calls() in areal/experimental/openai/tool_call_parser.py areal/experimental/openai/tool_call_parser.py61

Data Flow: From API Call to Training Data

The following diagram illustrates how a standard OpenAI API call is transformed into the rich data required for RL training.


Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/types.py143-194 areal/experimental/openai/cache.py112-162

Online Proxy and Authentication

In online mode, AReaL provides a two-tier authentication system:

EndpointAuth LevelPurpose
/rl/start_sessionAdminInitiates a new training session and issues session key areal/experimental/openai/proxy/server.py179
/chat/completionsSessionStandard inference with data collection areal/experimental/openai/proxy/server.py182
/rl/set_rewardSessionAssigns reward to a specific interaction areal/experimental/openai/proxy/server.py181
/export_trajectoriesSessionFinalizes the session and triggers data export areal/experimental/openai/proxy/server.py186

Sources: areal/experimental/openai/proxy/server.py179-187 areal/experimental/openai/proxy/server.py26-37