VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/6.5-reward-assignment-and-discounting

⇱ Reward Assignment and Discounting | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Reward Assignment and Discounting

This page describes how rewards are assigned to individual interactions (completions or responses) in AReaL's OpenAI-compatible API and how rewards are propagated backward through multi-turn conversations using temporal discounting. This is a core mechanism for converting scalar feedback signals into training data suitable for reinforcement learning.

For information about the overall agentic RL system and OpenAI client architecture, see Agentic RL Overview and ArealOpenAI Client. For details on how interactions are tracked and cached, see InteractionCache and Session Tracking. For exporting interactions as training data, see Interaction Export.


Purpose and Scope

When using the ArealOpenAI client to build agentic workflows, external agents generate multi-turn conversations that result in some final outcome (e.g., task success or failure). This page covers:

  1. Direct Reward Assignment: Setting reward values on individual interactions by ID or position.
  2. Backward Reward Discounting: Propagating rewards backward through conversation histories with geometric discounting.
  3. Thread Safety: Managing concurrent reward updates in the interaction cache.
  4. Serialization for Distribution: How rewards and interaction data are prepared for transport in the proxy server architecture.
  5. Asynchronous Reward Functions: Utilities for wrapping synchronous reward logic for efficient execution.

The reward assignment system is designed to support both immediate feedback (setting rewards as they occur) and delayed feedback (setting a terminal reward and propagating it backward).


Key Classes and Components

The reward assignment system involves several primary classes and utilities:

Reward Management Architecture

This diagram bridges the conceptual reward flow with the specific classes in the areal/experimental/openai package.


Sources: areal/experimental/openai/client.py1035-1163 areal/experimental/openai/cache.py13-160 areal/experimental/openai/types.py36-58 areal/experimental/openai/proxy/server.py66-121


Setting Rewards on Interactions

Rewards can be assigned to interactions in the InteractionCache areal/experimental/openai/cache.py13-18 via the ArealOpenAI client areal/experimental/openai/client.py1088-1097:

Setting Reward by Interaction ID

Each completion or response is assigned a unique ID. You can set the reward for a specific interaction using this ID:


The set_reward() method in InteractionCache updates the reward for the specific interaction and maintains a running total _total_reward areal/experimental/openai/cache.py44-49

Setting Reward on Last Interaction

For workflows where you process interactions sequentially, you can set the reward on the most recently created interaction using set_last_reward() areal/experimental/openai/cache.py51-53 Internally, this identifies the last key in the OrderedDict using the last_interaction_id property areal/experimental/openai/cache.py37-38


Backward Reward Discounting

In multi-turn conversations, a terminal reward is often propagated backward through the conversation history using the apply_reward_discount() method areal/experimental/openai/cache.py55-105

Algorithm

The discounting algorithm processes interactions in reverse creation order (most recent first):

  1. Start with the most recent interaction.
  2. If it has no explicit reward, set it to 0.0 (with a warning) areal/experimental/openai/cache.py94-101
  3. For each earlier interaction i:

Implementation Flow


Sources: areal/experimental/openai/cache.py55-105


Asynchronous Reward Execution

To prevent reward computation from blocking the main event loop in agentic workflows, AReaL provides the AsyncRewardWrapper areal/api/reward_api.py63-68

Features

Multi-Turn Workflow Integration

The MultiTurnWorkflow uses this wrapper to evaluate reasoning steps. If a reward is 0, it applies a turn_discount to the terminal reward based on the number of turns taken areal/experimental/workflow/multi_turn_v2.py86-88

Sources: areal/api/reward_api.py63-184 areal/experimental/workflow/multi_turn_v2.py17-96


Reward Handling in Proxy Server

When AReaL runs as an OpenAI-compatible proxy, rewards are managed within the SessionData object areal/experimental/openai/proxy/server.py66-78

Trajectory Export and Discounting

When the proxy server receives an export_trajectories request, it triggers the discounting logic on the session's cache before returning the data to the trainer areal/experimental/openai/proxy/server.py115-121

Serialization

Because rewards and interactions must be sent over HTTP, the system uses specialized serialization:


Thread Safety and State Management

The InteractionCache uses a threading.Lock to ensure safe concurrent reward updates areal/experimental/openai/cache.py18-46

Key Invariants

  1. Single Discount Application: apply_reward_discount() can only be called once per cache instance (enforced by _apply_reward_discount_called flag) areal/experimental/openai/cache.py86-88
  2. Total Reward Consistency: _total_reward is updated atomically when rewards change by subtracting the old reward and adding the new one areal/experimental/openai/cache.py47-49

Code Entity Reference

EntityLocationPurpose
ArealOpenAI.set_reward()areal/experimental/openai/client.py1088-1092Public API for setting reward by ID.
ArealOpenAI.apply_reward_discount()areal/experimental/openai/client.py1100-1128Public API for backward reward propagation.
InteractionCache.set_reward()areal/experimental/openai/cache.py44-49Thread-safe reward update with total tracking.
InteractionCache.apply_reward_discount()areal/experimental/openai/cache.py55-105Core discounting algorithm implementation.
AsyncRewardWrapperareal/api/reward_api.py63-68Utility for running reward logic in isolated processes.
InteractionWithTokenLogpRewardareal/experimental/openai/types.py36-58Dataclass holding the reward and interaction metadata.
SessionData.export_interactions()areal/experimental/openai/proxy/server.py115-121Triggers discounting before data export in proxy mode.
serialize_interactionsareal/experimental/openai/proxy/server.py129-150Prepares rewarded interactions for HTTP transport.

Sources: areal/experimental/openai/client.py1035-1163 areal/experimental/openai/cache.py13-160 areal/experimental/openai/types.py36-195 areal/experimental/openai/proxy/server.py115-150 areal/api/reward_api.py63-184