Last indexed: 7 May 2026 (2e12c1)

Normalization and Estimation

This page documents the normalization and statistical estimation functionality in AReaL, which is used primarily for reward and advantage normalization during reinforcement learning training. Normalization stabilizes training by reducing variance in reward signals and advantage estimates.

For information about data preprocessing and micro-batching, see 10.1 MicroBatch System For loss computation and PPO details, see 7.2 PPO Implementation

Overview

AReaL provides flexible normalization capabilities for rewards and advantages through the Normalization class and NormConfig dataclass. The system supports:

Batch-level normalization: Normalize across all sequences in a training batch. areal/api/cli_args.py44-49
Group-level normalization: Normalize within groups of sequences (e.g., multiple samples per prompt), essential for GRPO. areal/api/cli_args.py73-75
Leave-one-out averaging: Exclude the current sample when computing group means to reduce bias. areal/api/cli_args.py50-53
Unbiased estimation: Use Bessel's correction for standard deviation. areal/api/cli_args.py61-66
Masked Normalization: Low-level functional implementation that handles sharded tensors and distributed reduction. areal/utils/functional/functional.py16-25

Core Components

The normalization logic is defined by the NormConfig dataclass in areal/api/cli_args.py40-95 and utilized by the PPOActor in areal/trainer/ppo/actor.py55-58

Data Flow and Code Entity Space

The following diagram shows how normalization entities in the code interact during the advantage computation phase in the PPOActor.

Sources: areal/trainer/ppo/actor.py55-58 areal/api/cli_args.py40-75 areal/utils/data.py29 areal/utils/functional/functional.py16

NormConfig Structure

The NormConfig defines the behavior of the normalization transformation.

Parameter	Type	Default	Description
`mean_level`	`str \| None`	`"batch"`	Level for mean: `"batch"`, `"group"`, or `None`.
`mean_leave1out`	`bool`	`False`	If True, uses leave-one-out mean for groups.
`std_level`	`str \| None`	`"batch"`	Level for std: `"batch"`, `"group"`, or `None`.
`std_unbiased`	`bool`	`True`	Uses Bessel's correction (N-1) for standard deviation.
`eps`	`float`	`1e-5`	Numerical stability constant.
`group_size`	`int`	`1`	Size of the group for group-level normalization.

Sources: areal/api/cli_args.py43-75 tests/test_adv_norm_config.py37-41

Normalization Implementation

Masked Normalization Functional

The core mathematical logic resides in masked_normalization. It supports high-precision calculation (using float64) and distributed all_reduce to ensure consistency across sharded data.

Precision: Optionally uses torch.float64 for summing squares to prevent overflow/underflow. areal/utils/functional/functional.py26-27
Distributed: Performs dist.all_reduce on the factor (mask sum), x_sum, and x_sum_sq when all_reduce=True. areal/utils/functional/functional.py40-47
Formula: ((x - mean) / (var.sqrt() + eps)) where var = (sum(x^2)/N) - (sum(x)/N)^2. areal/utils/functional/functional.py48-53

Sequence-Level Advantages (GSPO)

For algorithms requiring sequence-level statistics (like GSPO), the _compute_sequence_level_ratio_and_advantages function handles both packed (1D) and padded (2D) tensors.

Packed Format: Uses cu_seqlens and scatter_add_ to compute per-sequence means efficiently. areal/utils/functional/functional.py89-113
Broadcasting: Broadcasts sequence-averaged advantages back to token-level to maintain gradient magnitude independence from sequence length. areal/utils/functional/functional.py123-127

Sources: areal/utils/functional/functional.py16-53 areal/utils/functional/functional.py56-147

Statistical Estimation and Denominators

When logging training statistics (e.g., loss, clip ratios), AReaL must determine the correct number of tokens to use as a denominator. This is complicated by sequence packing and Context Parallelism (CP), where tensors are sliced across devices.

Token Denominator Inference

The function infer_token_denominator ensures statistics remain consistent regardless of parallelism strategies by preferring metadata over sliced tensors.

Sources: areal/trainer/ppo/stats.py10-38

Vocab-Parallel Log Probability Estimation

For models using Tensor Parallelism (TP) where the vocabulary is sharded, AReaL implements a memory-efficient _VocabParallelLogProbs autograd function.

Memory Optimization

In-place Operations: Reuses the softmax tensor as the gradient input (grad_input) during the backward pass to avoid allocating a large [seq_len, vocab/tp] tensor. areal/utils/functional/vocab_parallel.py94-108
Numerical Stability: Subtracts the global max (via all_reduce) before computing exponentials. areal/utils/functional/vocab_parallel.py134-139

Chunked Application

To further reduce peak memory during log-probability and entropy computation, the system uses _chunked_apply, which processes tensors in small segments along the sequence dimension. areal/utils/functional/vocab_parallel.py41-61

Sources: areal/utils/functional/vocab_parallel.py84-117 areal/utils/functional/vocab_parallel.py119-168

Reward and Advantage Processing

In PPOActor._compute_advantages, several estimation and normalization steps occur in sequence:

Length Penalty: reward_overlong_penalty is applied if responses exceed overlong_tokens. areal/trainer/ppo/actor.py153-165
Reward Scaling/Clipping: Rewards are shifted by reward_bias, multiplied by reward_scaling, and clipped to reward_clip. areal/trainer/ppo/actor.py167-171
Reward Normalization: If reward_norm is configured, the Normalization object is called on the reward scores. areal/trainer/ppo/actor.py172-173
KL Regularization: KL divergence between the current policy and reference policy is estimated using KLEstimator. areal/trainer/ppo/actor.py53 areal/trainer/ppo/actor.py204-211
Advantage Normalization: Final advantages are passed through self.adv_norm before being used for the policy gradient. areal/trainer/ppo/actor.py55 areal/trainer/ppo/actor.py255

Proximal Policy Estimation

For decoupled (off-policy) PPO, AReaL estimates the "proximal" policy (the policy used to collect the data) using different methods. This is critical for stability when the training policy drifts from the behavior policy.

Proximal Approximation Methods

Log-linear Interpolation: Approximates the proximal log-probability using linear interpolation in log-space between the behavior version and the current training version. tests/test_prox_approx.py24-43
Linear Interpolation: Performs arithmetic mean interpolation in probability space. tests/test_prox_approx.py108-127
Rollout Method: Returns the behavior log-probability unchanged (uses behavior policy as-is). tests/test_prox_approx.py46-63

Sources: areal/trainer/ppo/actor.py71-124 areal/utils/constants.py11-18 tests/test_prox_approx.py24-127

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/10.3-normalization-and-estimation