VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/10.3-normalization-and-estimation

⇱ Normalization and Estimation | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Normalization and Estimation

This page documents the normalization and statistical estimation functionality in AReaL, which is used primarily for reward and advantage normalization during reinforcement learning training. Normalization stabilizes training by reducing variance in reward signals and advantage estimates.

For information about data preprocessing and micro-batching, see 10.1 MicroBatch System For loss computation and PPO details, see 7.2 PPO Implementation


Overview

AReaL provides flexible normalization capabilities for rewards and advantages through the Normalization class and NormConfig dataclass. The system supports:


Core Components

The normalization logic is defined by the NormConfig dataclass in areal/api/cli_args.py40-95 and utilized by the PPOActor in areal/trainer/ppo/actor.py55-58

Data Flow and Code Entity Space

The following diagram shows how normalization entities in the code interact during the advantage computation phase in the PPOActor.


Sources: areal/trainer/ppo/actor.py55-58 areal/api/cli_args.py40-75 areal/utils/data.py29 areal/utils/functional/functional.py16

NormConfig Structure

The NormConfig defines the behavior of the normalization transformation.

ParameterTypeDefaultDescription
mean_levelstr | None"batch"Level for mean: "batch", "group", or None.
mean_leave1outboolFalseIf True, uses leave-one-out mean for groups.
std_levelstr | None"batch"Level for std: "batch", "group", or None.
std_unbiasedboolTrueUses Bessel's correction (N-1) for standard deviation.
epsfloat1e-5Numerical stability constant.
group_sizeint1Size of the group for group-level normalization.

Sources: areal/api/cli_args.py43-75 tests/test_adv_norm_config.py37-41


Normalization Implementation

Masked Normalization Functional

The core mathematical logic resides in masked_normalization. It supports high-precision calculation (using float64) and distributed all_reduce to ensure consistency across sharded data.

Sequence-Level Advantages (GSPO)

For algorithms requiring sequence-level statistics (like GSPO), the _compute_sequence_level_ratio_and_advantages function handles both packed (1D) and padded (2D) tensors.

Sources: areal/utils/functional/functional.py16-53 areal/utils/functional/functional.py56-147


Statistical Estimation and Denominators

When logging training statistics (e.g., loss, clip ratios), AReaL must determine the correct number of tokens to use as a denominator. This is complicated by sequence packing and Context Parallelism (CP), where tensors are sliced across devices.

Token Denominator Inference

The function infer_token_denominator ensures statistics remain consistent regardless of parallelism strategies by preferring metadata over sliced tensors.


Sources: areal/trainer/ppo/stats.py10-38


Vocab-Parallel Log Probability Estimation

For models using Tensor Parallelism (TP) where the vocabulary is sharded, AReaL implements a memory-efficient _VocabParallelLogProbs autograd function.

Memory Optimization

Chunked Application

To further reduce peak memory during log-probability and entropy computation, the system uses _chunked_apply, which processes tensors in small segments along the sequence dimension. areal/utils/functional/vocab_parallel.py41-61

Sources: areal/utils/functional/vocab_parallel.py84-117 areal/utils/functional/vocab_parallel.py119-168


Reward and Advantage Processing

In PPOActor._compute_advantages, several estimation and normalization steps occur in sequence:

  1. Length Penalty: reward_overlong_penalty is applied if responses exceed overlong_tokens. areal/trainer/ppo/actor.py153-165
  2. Reward Scaling/Clipping: Rewards are shifted by reward_bias, multiplied by reward_scaling, and clipped to reward_clip. areal/trainer/ppo/actor.py167-171
  3. Reward Normalization: If reward_norm is configured, the Normalization object is called on the reward scores. areal/trainer/ppo/actor.py172-173
  4. KL Regularization: KL divergence between the current policy and reference policy is estimated using KLEstimator. areal/trainer/ppo/actor.py53 areal/trainer/ppo/actor.py204-211
  5. Advantage Normalization: Final advantages are passed through self.adv_norm before being used for the policy gradient. areal/trainer/ppo/actor.py55 areal/trainer/ppo/actor.py255

Proximal Policy Estimation

For decoupled (off-policy) PPO, AReaL estimates the "proximal" policy (the policy used to collect the data) using different methods. This is critical for stability when the training policy drifts from the behavior policy.

Proximal Approximation Methods

Sources: areal/trainer/ppo/actor.py71-124 areal/utils/constants.py11-18 tests/test_prox_approx.py24-127