Last indexed: 7 May 2026 (2e12c1)

Agentic RL Integration

AReaL's agentic RL system enables training of agent frameworks (OpenAI Agents SDK, CAMEL-AI, Claude SDK, LangChain) using reinforcement learning, while maintaining full token-level tracking and reward propagation. The system bridges the gap between agent frameworks designed for inference and the requirements of RL training.

This page provides an overview of the agentic RL infrastructure. For detailed information on specific components:

Agentic RL Overview — Motivation and design philosophy
ArealOpenAI Client — Extended OpenAI client with reward tracking
InteractionCache and Session Tracking — Interaction storage and management
Multi-turn Conversations — Building conversation trees with token alignment
Reward Assignment and Discounting — Reward propagation algorithms
Proxy Server Architecture — HTTP proxy design for external agents
Tool Call Integration — Tool calling and parsing support
Interaction Export — Converting interactions to training format
Agent Service Architecture — Experimental microservice-based agent serving system

The Problem: Agent Frameworks Meet RL Training

Agent frameworks are designed for inference and lack three critical features for RL training:

No token-level access: Frameworks use high-level APIs (e.g., OpenAI's chat completion API) that do not expose token IDs and log probabilities required for computing policy gradients. areal/experimental/openai/client.py65-67
No reward mechanism: Frameworks have no built-in reward functions. RL training requires reward signals assigned to specific model outputs. areal/experimental/openai/cache.py44-53
Limited parallelization: Standard agent usage involves sequential execution, making it difficult to efficiently collect diverse trajectories needed for RL training. areal/experimental/openai/proxy/proxy_rollout_server.py98-102

Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/cache.py44-53 areal/experimental/openai/proxy/proxy_rollout_server.py98-102

AReaL's Solution: Transparent Tracking Layer

AReaL solves these problems by intercepting LLM API calls and maintaining a complete interaction history through a specialized tracking layer.

System Architecture Diagram:

Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/cache.py13-112 areal/experimental/openai/proxy/server.py66-78 areal/experimental/openai/types.py143-194

Integration Paradigms

AReaL supports execution modes for agent workflows through several approaches:

Inline/Subprocess Integration (Recommended)

The agent runs within the AReaL rollout worker environment. The MultiTurnWorkflow uses ArealOpenAI directly to manage multi-turn episodes, including reflection messages and turn discounting. areal/experimental/workflow/multi_turn_v2.py17-43

Sources: areal/experimental/workflow/multi_turn_v2.py44-96 areal/experimental/openai/client.py73-75

Online Mode (External)

External applications interact with AReaL via a Proxy Gateway. The system manages sessions via StartSessionRequest and tracks interaction history in SessionData. areal/experimental/openai/proxy/server.py26-30 areal/experimental/openai/proxy/server.py66-78

Sources: areal/experimental/openai/proxy/server.py26-30 areal/experimental/openai/proxy/server.py66-78

Core Components

The agentic RL system consists of several specialized components that bridge the gap between "Natural Language Space" (APIs) and "Code Entity Space" (Training Tensors).

Component Interaction Flow

Sources: areal/experimental/openai/client.py54-58 areal/experimental/openai/cache.py13-41 areal/experimental/openai/types.py35-58

1. ArealOpenAI Client

Extends the standard AsyncOpenAI client to capture ModelResponse data, including token IDs and log probabilities. It handles the mapping between OpenAI's chat format and the underlying inference engine requests. areal/experimental/openai/client.py65-76

Key class: ArealOpenAI in areal/experimental/openai/client.py areal/experimental/openai/client.py47

2. InteractionCache

A specialized OrderedDict that stores InteractionWithTokenLogpReward objects. It automatically constructs parent-child relationships by comparing message history prefixes, enabling the reconstruction of conversation trees for multi-turn RL. areal/experimental/openai/cache.py13-112

Key class: InteractionCache in areal/experimental/openai/cache.py areal/experimental/openai/cache.py13

3. Reward Assignment and Discounting

Rewards can be assigned to specific interactions using set_reward or the most recent completion via set_last_reward. The cache supports backward propagation of rewards through the conversation tree using a turn_discount factor. areal/experimental/openai/cache.py44-54 areal/experimental/openai/cache.py55-84

Key methods: set_reward(), apply_reward_discount() in areal/experimental/openai/cache.py areal/experimental/openai/cache.py44 areal/experimental/openai/cache.py55

4. Tool Call Integration

The system includes parsers to extract tool calls from raw model text and convert them into structured tool call objects compatible with agent frameworks. It supports both SGLang and vLLM parser logic. areal/experimental/openai/tool_call_parser.py61-156

Key function: process_tool_calls() in areal/experimental/openai/tool_call_parser.py areal/experimental/openai/tool_call_parser.py61

Data Flow: From API Call to Training Data

The following diagram illustrates how a standard OpenAI API call is transformed into the rich data required for RL training.

Sources: areal/experimental/openai/client.py65-67 areal/experimental/openai/types.py143-194 areal/experimental/openai/cache.py112-162

Online Proxy and Authentication

In online mode, AReaL provides a two-tier authentication system:

Admin API Key: Configured via DEFAULT_ADMIN_API_KEY, used for management tasks like starting sessions. areal/experimental/openai/proxy/server.py191
Session API Key: Issued per session via StartSessionResponse, used for specific agent trajectories to ensure data isolation. areal/experimental/openai/proxy/server.py33-37

Endpoint	Auth Level	Purpose
`/rl/start_session`	Admin	Initiates a new training session and issues session key areal/experimental/openai/proxy/server.py179
`/chat/completions`	Session	Standard inference with data collection areal/experimental/openai/proxy/server.py182
`/rl/set_reward`	Session	Assigns reward to a specific interaction areal/experimental/openai/proxy/server.py181
`/export_trajectories`	Session	Finalizes the session and triggers data export areal/experimental/openai/proxy/server.py186

Sources: areal/experimental/openai/proxy/server.py179-187 areal/experimental/openai/proxy/server.py26-37

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/6-agentic-rl-integration