Last indexed: 7 May 2026 (2e12c1)

Algorithm-Specific Configurations

Purpose and Scope

This page documents the configuration parameters specific to reinforcement learning algorithms supported by AReaL, including PPO, GRPO, DAPO, and their variants. These configurations control algorithm behavior such as clipping strategies, normalization, minibatch processing, reward shaping, and on-policy distillation.

For general training engine configurations (optimizers, parallelism), see 2.4 Training Engine Configurations For dataset and data processing configurations, see 2.7 MicroBatchSpec and Data Configurations For the overall configuration system structure, see 2.1 Configuration Overview

Configuration Classes Overview

The algorithm configuration is primarily managed through PPOActorConfig and PPOCriticConfig, which contain nested NormConfig objects for reward and advantage normalization. These classes are defined in areal/api/cli_args.py.

Algorithm Configuration Hierarchy

Title: Algorithm Configuration Hierarchy (Natural Language to Code Entity Space)

Sources: areal/api/cli_args.py42-96 areal/api/cli_args.py164-210 areal/trainer/ppo/actor.py43-68

NormConfig: Normalization Configuration

The NormConfig dataclass areal/api/cli_args.py42-78 controls how rewards and advantages are normalized. Normalization is handled by the Normalization class in areal/utils/data.py areal/trainer/ppo/actor.py29

Parameters

Parameter	Type	Default	Description
`mean_level`	`str \| None`	`"batch"`	Mean level for normalization: `"batch"`, `"group"`, or `None`. areal/api/cli_args.py46-52
`mean_leave1out`	`bool`	`False`	Whether to use leave-one-out average for bias reduction. areal/api/cli_args.py53-56
`std_level`	`str \| None`	`"batch"`	Std level for normalization: `"batch"`, `"group"`, or `None`. areal/api/cli_args.py57-63
`std_unbiased`	`bool`	`True`	Whether to use unbiased standard deviation computation. areal/api/cli_args.py64-69
`eps`	`float`	`1e-5`	Epsilon to avoid numerical issues during division. areal/api/cli_args.py70-75
`group_size`	`int`	`1`	Group size for group-level normalization (required if level is "group"). areal/api/cli_args.py76-78

Normalization Implementation

The normalization logic is validated in NormConfig.__post_init__ areal/api/cli_args.py80-97 and applied during advantage computation in PPOActor._compute_advantages areal/trainer/ppo/actor.py145-173

Title: Normalization Logic Flow

Batch-level normalization computes statistics across the entire batch.

Group-level normalization computes statistics within groups of group_size samples. This is the core mechanism for GRPO (Group Relative Policy Optimization) areal/api/cli_args.py91-96

Sources: areal/api/cli_args.py42-96 areal/trainer/ppo/actor.py55-58 areal/trainer/ppo/actor.py29

PPOActorConfig: Algorithm-Specific Parameters

PPOActorConfig contains parameters that dictate the behavior of the Actor during policy updates.

Core RL Parameters

Parameter	Type	Default	Description
`eps_clip`	`float`	`0.4`	PPO clipping threshold for policy ratio. areal/trainer/ppo/actor.py126
`kl_ctl`	`float`	`0.0`	KL divergence penalty coefficient. areal/trainer/ppo/actor.py52
`reward_scaling`	`float`	`1.0`	Scaling factor applied to rewards. areal/trainer/ppo/actor.py49
`discount`	`float`	`1.0`	Discount factor (gamma) for GAE. areal/trainer/ppo/actor.py60
`gae_lambda`	`float`	`0.95`	GAE lambda parameter. areal/trainer/ppo/actor.py61
`m2_threshold`	`float`	`None`	Second-moment trust policy optimization (M2PO) threshold. areal/trainer/ppo/actor.py66

Proximal Policy Computation

AReaL supports various methods for computing the proximal policy ($\pi_{prox}$), which is critical for off-policy or decoupled settings. These methods are defined as constants in areal/utils/constants.py.

Method	Constant	Description
`RECOMPUTE`	`PROX_LOGP_METHOD_RECOMPUTE`	Recompute logprobs from current policy via forward pass. areal/utils/constants.py26
`LOGLINEAR`	`PROX_LOGP_METHOD_LOGLINEAR`	Log-linear approximation (no forward pass). areal/trainer/ppo/actor.py95
`METRICS`	`PROX_LOGP_METHOD_METRICS`	Recomputed + approximation metrics for evaluation. areal/trainer/ppo/actor.py96

Sources: areal/api/cli_args.py42-96 areal/trainer/ppo/actor.py43-128 areal/utils/constants.py25-28

On-Policy Distillation Configuration

AReaL supports on-policy distillation where a student mimics a teacher on self-generated trajectories docs/en/algorithms/distillation.md5-13

Parameter	Type	Description
`rl_loss_weight`	`float`	Weight for the GRPO/RL loss term. examples/distillation/gsm8k_grpo_distill.yaml92
`distill_loss_weight`	`float`	Weight for the Reverse KL (RKL) distillation term. examples/distillation/gsm8k_grpo_distill.yaml93

The joint loss is implemented as loss = rl_loss_weight * loss + distill_loss_weight * rkl_penalty docs/en/algorithms/distillation.md77

Sources: docs/en/algorithms/distillation.md1-78 examples/distillation/gsm8k_grpo_distill.yaml90-98

Micro-Batching Control

The MicroBatchSpec dataclass areal/api/cli_args.py99-138 defines how a training batch is split into smaller chunks.

Parameter	Type	Default	Description
`n_mbs`	`int \| None`	`1`	Number of micro-batches or minimum count. areal/api/cli_args.py103-108
`max_tokens_per_mb`	`int \| None`	`None`	Maximum tokens per micro-batch. areal/api/cli_args.py115-120
`packing_algorithm`	`str`	`"ffd"`	Algorithm for MB allocation: `"ffd"` (First Fit Decreasing) or `"kk"` (Karmarkar-Karp). areal/api/cli_args.py127-139

The kk algorithm is recommended when workload balance across DP ranks is critical, as it provides better balance than ffd for variable-length sequences areal/api/cli_args.py134-136

Sources: areal/api/cli_args.py99-140 areal/api/cli_args.py127-138

PPOCriticConfig: Critic Network Parameters

The critic configuration manages the value function update process.

Parameter	Type	Default	Description
`eps_clip`	`float`	`0.2`	Clipping threshold for value function updates. areal/trainer/ppo/critic.py108
`ppo_n_minibatches`	`int`	`1`	Number of minibatches for critic update. areal/trainer/ppo/critic.py63

Data Flow for Training

Input data is processed through the micro-batching system before being passed to algorithm-specific loss functions.

Title: Data Flow from MicroBatch to Loss (Natural Language to Code Entity Space)

Sources: areal/trainer/ppo/actor.py31-37 areal/trainer/ppo/critic.py61-73 areal/trainer/ppo/actor.py31

Summary of Algorithm Configurations

Algorithm	Key Config Pattern
PPO	`adv_norm.mean_level="batch"`, `recompute_logprob=False`
GRPO	`reward_norm.mean_level="group"`, `kl_ctl=0.0`
DAPO	`use_decoupled_loss=True`, `prox_logp_method="recompute"`
Distill	`teacher.distill_loss_weight > 0`, `teacher.rl_loss_weight`

Sources: areal/trainer/ppo/actor.py80-101 docs/en/algorithms/distillation.md52-78 examples/distillation/gsm8k_grpo_distill.yaml62-80

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/2.8-algorithm-specific-configurations