VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/2.8-algorithm-specific-configurations

⇱ Algorithm-Specific Configurations | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Algorithm-Specific Configurations

Purpose and Scope

This page documents the configuration parameters specific to reinforcement learning algorithms supported by AReaL, including PPO, GRPO, DAPO, and their variants. These configurations control algorithm behavior such as clipping strategies, normalization, minibatch processing, reward shaping, and on-policy distillation.

For general training engine configurations (optimizers, parallelism), see 2.4 Training Engine Configurations For dataset and data processing configurations, see 2.7 MicroBatchSpec and Data Configurations For the overall configuration system structure, see 2.1 Configuration Overview

Configuration Classes Overview

The algorithm configuration is primarily managed through PPOActorConfig and PPOCriticConfig, which contain nested NormConfig objects for reward and advantage normalization. These classes are defined in areal/api/cli_args.py.

Algorithm Configuration Hierarchy

Title: Algorithm Configuration Hierarchy (Natural Language to Code Entity Space)


Sources: areal/api/cli_args.py42-96 areal/api/cli_args.py164-210 areal/trainer/ppo/actor.py43-68

NormConfig: Normalization Configuration

The NormConfig dataclass areal/api/cli_args.py42-78 controls how rewards and advantages are normalized. Normalization is handled by the Normalization class in areal/utils/data.py areal/trainer/ppo/actor.py29

Parameters

ParameterTypeDefaultDescription
mean_levelstr | None"batch"Mean level for normalization: "batch", "group", or None. areal/api/cli_args.py46-52
mean_leave1outboolFalseWhether to use leave-one-out average for bias reduction. areal/api/cli_args.py53-56
std_levelstr | None"batch"Std level for normalization: "batch", "group", or None. areal/api/cli_args.py57-63
std_unbiasedboolTrueWhether to use unbiased standard deviation computation. areal/api/cli_args.py64-69
epsfloat1e-5Epsilon to avoid numerical issues during division. areal/api/cli_args.py70-75
group_sizeint1Group size for group-level normalization (required if level is "group"). areal/api/cli_args.py76-78

Normalization Implementation

The normalization logic is validated in NormConfig.__post_init__ areal/api/cli_args.py80-97 and applied during advantage computation in PPOActor._compute_advantages areal/trainer/ppo/actor.py145-173

Title: Normalization Logic Flow


Batch-level normalization computes statistics across the entire batch.

Group-level normalization computes statistics within groups of group_size samples. This is the core mechanism for GRPO (Group Relative Policy Optimization) areal/api/cli_args.py91-96

Sources: areal/api/cli_args.py42-96 areal/trainer/ppo/actor.py55-58 areal/trainer/ppo/actor.py29

PPOActorConfig: Algorithm-Specific Parameters

PPOActorConfig contains parameters that dictate the behavior of the Actor during policy updates.

Core RL Parameters

ParameterTypeDefaultDescription
eps_clipfloat0.4PPO clipping threshold for policy ratio. areal/trainer/ppo/actor.py126
kl_ctlfloat0.0KL divergence penalty coefficient. areal/trainer/ppo/actor.py52
reward_scalingfloat1.0Scaling factor applied to rewards. areal/trainer/ppo/actor.py49
discountfloat1.0Discount factor (gamma) for GAE. areal/trainer/ppo/actor.py60
gae_lambdafloat0.95GAE lambda parameter. areal/trainer/ppo/actor.py61
m2_thresholdfloatNoneSecond-moment trust policy optimization (M2PO) threshold. areal/trainer/ppo/actor.py66

Proximal Policy Computation

AReaL supports various methods for computing the proximal policy ($\pi_{prox}$), which is critical for off-policy or decoupled settings. These methods are defined as constants in areal/utils/constants.py.

MethodConstantDescription
RECOMPUTEPROX_LOGP_METHOD_RECOMPUTERecompute logprobs from current policy via forward pass. areal/utils/constants.py26
LOGLINEARPROX_LOGP_METHOD_LOGLINEARLog-linear approximation (no forward pass). areal/trainer/ppo/actor.py95
METRICSPROX_LOGP_METHOD_METRICSRecomputed + approximation metrics for evaluation. areal/trainer/ppo/actor.py96

Sources: areal/api/cli_args.py42-96 areal/trainer/ppo/actor.py43-128 areal/utils/constants.py25-28

On-Policy Distillation Configuration

AReaL supports on-policy distillation where a student mimics a teacher on self-generated trajectories docs/en/algorithms/distillation.md5-13

ParameterTypeDescription
rl_loss_weightfloatWeight for the GRPO/RL loss term. examples/distillation/gsm8k_grpo_distill.yaml92
distill_loss_weightfloatWeight for the Reverse KL (RKL) distillation term. examples/distillation/gsm8k_grpo_distill.yaml93

The joint loss is implemented as loss = rl_loss_weight * loss + distill_loss_weight * rkl_penalty docs/en/algorithms/distillation.md77

Sources: docs/en/algorithms/distillation.md1-78 examples/distillation/gsm8k_grpo_distill.yaml90-98

Micro-Batching Control

The MicroBatchSpec dataclass areal/api/cli_args.py99-138 defines how a training batch is split into smaller chunks.

ParameterTypeDefaultDescription
n_mbsint | None1Number of micro-batches or minimum count. areal/api/cli_args.py103-108
max_tokens_per_mbint | NoneNoneMaximum tokens per micro-batch. areal/api/cli_args.py115-120
packing_algorithmstr"ffd"Algorithm for MB allocation: "ffd" (First Fit Decreasing) or "kk" (Karmarkar-Karp). areal/api/cli_args.py127-139

The kk algorithm is recommended when workload balance across DP ranks is critical, as it provides better balance than ffd for variable-length sequences areal/api/cli_args.py134-136

Sources: areal/api/cli_args.py99-140 areal/api/cli_args.py127-138

PPOCriticConfig: Critic Network Parameters

The critic configuration manages the value function update process.

ParameterTypeDefaultDescription
eps_clipfloat0.2Clipping threshold for value function updates. areal/trainer/ppo/critic.py108
ppo_n_minibatchesint1Number of minibatches for critic update. areal/trainer/ppo/critic.py63

Data Flow for Training

Input data is processed through the micro-batching system before being passed to algorithm-specific loss functions.

Title: Data Flow from MicroBatch to Loss (Natural Language to Code Entity Space)


Sources: areal/trainer/ppo/actor.py31-37 areal/trainer/ppo/critic.py61-73 areal/trainer/ppo/actor.py31

Summary of Algorithm Configurations

AlgorithmKey Config Pattern
PPOadv_norm.mean_level="batch", recompute_logprob=False
GRPOreward_norm.mean_level="group", kl_ctl=0.0
DAPOuse_decoupled_loss=True, prox_logp_method="recompute"
Distillteacher.distill_loss_weight > 0, teacher.rl_loss_weight

Sources: areal/trainer/ppo/actor.py80-101 docs/en/algorithms/distillation.md52-78 examples/distillation/gsm8k_grpo_distill.yaml62-80