Last indexed: 7 May 2026 (2e12c1)

Reward Assignment and Discounting

This page describes how rewards are assigned to individual interactions (completions or responses) in AReaL's OpenAI-compatible API and how rewards are propagated backward through multi-turn conversations using temporal discounting. This is a core mechanism for converting scalar feedback signals into training data suitable for reinforcement learning.

For information about the overall agentic RL system and OpenAI client architecture, see Agentic RL Overview and ArealOpenAI Client. For details on how interactions are tracked and cached, see InteractionCache and Session Tracking. For exporting interactions as training data, see Interaction Export.

Purpose and Scope

When using the ArealOpenAI client to build agentic workflows, external agents generate multi-turn conversations that result in some final outcome (e.g., task success or failure). This page covers:

Direct Reward Assignment: Setting reward values on individual interactions by ID or position.
Backward Reward Discounting: Propagating rewards backward through conversation histories with geometric discounting.
Thread Safety: Managing concurrent reward updates in the interaction cache.
Serialization for Distribution: How rewards and interaction data are prepared for transport in the proxy server architecture.
Asynchronous Reward Functions: Utilities for wrapping synchronous reward logic for efficient execution.

The reward assignment system is designed to support both immediate feedback (setting rewards as they occur) and delayed feedback (setting a terminal reward and propagating it backward).

Key Classes and Components

The reward assignment system involves several primary classes and utilities:

Reward Management Architecture

This diagram bridges the conceptual reward flow with the specific classes in the areal/experimental/openai package.

Sources: areal/experimental/openai/client.py1035-1163 areal/experimental/openai/cache.py13-160 areal/experimental/openai/types.py36-58 areal/experimental/openai/proxy/server.py66-121

Setting Rewards on Interactions

Rewards can be assigned to interactions in the InteractionCache areal/experimental/openai/cache.py13-18 via the ArealOpenAI client areal/experimental/openai/client.py1088-1097:

Setting Reward by Interaction ID

Each completion or response is assigned a unique ID. You can set the reward for a specific interaction using this ID:

The set_reward() method in InteractionCache updates the reward for the specific interaction and maintains a running total _total_reward areal/experimental/openai/cache.py44-49

Setting Reward on Last Interaction

For workflows where you process interactions sequentially, you can set the reward on the most recently created interaction using set_last_reward() areal/experimental/openai/cache.py51-53 Internally, this identifies the last key in the OrderedDict using the last_interaction_id property areal/experimental/openai/cache.py37-38

Backward Reward Discounting

In multi-turn conversations, a terminal reward is often propagated backward through the conversation history using the apply_reward_discount() method areal/experimental/openai/cache.py55-105

Algorithm

The discounting algorithm processes interactions in reverse creation order (most recent first):

Start with the most recent interaction.
If it has no explicit reward, set it to 0.0 (with a warning) areal/experimental/openai/cache.py94-101
For each earlier interaction i:
- Initialize its reward to 0.0 if unset.
- Compute: reward[i] = current_reward * turn_discount + interaction.reward areal/experimental/openai/cache.py103-104

Implementation Flow

Sources: areal/experimental/openai/cache.py55-105

Asynchronous Reward Execution

To prevent reward computation from blocking the main event loop in agentic workflows, AReaL provides the AsyncRewardWrapper areal/api/reward_api.py63-68

Features

Process Isolation: Executes reward functions in a ProcessPoolExecutor to avoid GIL contention and ensure stability areal/api/reward_api.py95-97
Automatic Recovery: Automatically recreates the process pool if it becomes broken during execution areal/api/reward_api.py119-136
Timeout Handling: Supports a configurable timeout_seconds (default 15s), returning a reward of 0 if exceeded areal/api/reward_api.py159-163
Resource Management: Uses weakref.finalize to shut down executors when the wrapper instance is garbage collected areal/api/reward_api.py101-116

Multi-Turn Workflow Integration

The MultiTurnWorkflow uses this wrapper to evaluate reasoning steps. If a reward is 0, it applies a turn_discount to the terminal reward based on the number of turns taken areal/experimental/workflow/multi_turn_v2.py86-88

Sources: areal/api/reward_api.py63-184 areal/experimental/workflow/multi_turn_v2.py17-96

Reward Handling in Proxy Server

When AReaL runs as an OpenAI-compatible proxy, rewards are managed within the SessionData object areal/experimental/openai/proxy/server.py66-78

Trajectory Export and Discounting

When the proxy server receives an export_trajectories request, it triggers the discounting logic on the session's cache before returning the data to the trainer areal/experimental/openai/proxy/server.py115-121

Serialization

Because rewards and interactions must be sent over HTTP, the system uses specialized serialization:

serialize_interactions: Converts InteractionWithTokenLogpReward objects into JSON-compatible dictionaries. It handles both raw message lists and tensor-based data (like logprobs and versions) areal/experimental/openai/proxy/server.py129-150
deserialize_interactions: Reconstructs the objects on the receiving end, populating the internal _cache with tensor data if present areal/experimental/openai/proxy/server.py153-172

Thread Safety and State Management

The InteractionCache uses a threading.Lock to ensure safe concurrent reward updates areal/experimental/openai/cache.py18-46

Key Invariants

Single Discount Application: apply_reward_discount() can only be called once per cache instance (enforced by _apply_reward_discount_called flag) areal/experimental/openai/cache.py86-88
Total Reward Consistency: _total_reward is updated atomically when rewards change by subtracting the old reward and adding the new one areal/experimental/openai/cache.py47-49

Code Entity Reference

Entity	Location	Purpose
`ArealOpenAI.set_reward()`	areal/experimental/openai/client.py1088-1092	Public API for setting reward by ID.
`ArealOpenAI.apply_reward_discount()`	areal/experimental/openai/client.py1100-1128	Public API for backward reward propagation.
`InteractionCache.set_reward()`	areal/experimental/openai/cache.py44-49	Thread-safe reward update with total tracking.
`InteractionCache.apply_reward_discount()`	areal/experimental/openai/cache.py55-105	Core discounting algorithm implementation.
`AsyncRewardWrapper`	areal/api/reward_api.py63-68	Utility for running reward logic in isolated processes.
`InteractionWithTokenLogpReward`	areal/experimental/openai/types.py36-58	Dataclass holding the reward and interaction metadata.
`SessionData.export_interactions()`	areal/experimental/openai/proxy/server.py115-121	Triggers discounting before data export in proxy mode.
`serialize_interactions`	areal/experimental/openai/proxy/server.py129-150	Prepares rewarded interactions for HTTP transport.

Sources: areal/experimental/openai/client.py1035-1163 areal/experimental/openai/cache.py13-160 areal/experimental/openai/types.py36-195 areal/experimental/openai/proxy/server.py115-150 areal/api/reward_api.py63-184

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/6.5-reward-assignment-and-discounting