Last indexed: 7 May 2026 (2e12c1)

RL Algorithms

This page provides an overview of the reinforcement learning algorithms supported by AReaL, their implementations, and how to configure them. AReaL is designed as a fully asynchronous RL system, supporting both synchronous and asynchronous execution modes for all algorithms.

Supported Algorithms

AReaL supports a comprehensive suite of RL algorithms optimized for large language model alignment and reasoning tasks, with a particular focus on group-based, off-policy variants, and on-policy distillation.

On-Policy and Group-Based Algorithms

Algorithm	Description	Key Features
GRPO	Group Relative Policy Optimization	Uses group-based advantage normalization; no value network required. areal/trainer/ppo/actor.py68-69
PPO	Proximal Policy Optimization	Clipped surrogate objective with actor-critic setup. areal/trainer/ppo/actor.py81-84
DAPO	Dynamic Batch PPO	Dynamic adaptation for training efficiency. areal/trainer/ppo/actor.py68-69
GSPO	Guided Self-Play Optimization	Self-play with guided exploration. areal/trainer/ppo/actor.py68-69
RLOO	REINFORCE Leave-One-Out	Variance reduction via leave-one-out baseline. areal/trainer/rl_trainer.py28
M2PO	Second-Moment Trust Policy Optimization	Constrains second moment of weights for stable off-policy training. areal/trainer/ppo/actor.py66

Supervised, Preference, and Reward Learning

Algorithm	Description	Implementation
SFT	Supervised Fine-Tuning	Standard supervised training logic via `SFTTrainer`. areal/trainer/sft_trainer.py54
DPO	Direct Preference Optimization	Optimizes policy directly from preferences without an explicit reward model. areal/trainer/dpo_trainer.py84
RW	Bradley-Terry Reward Modeling	Preference-based reward model training via `RWTrainer`. areal/trainer/rw_trainer.py76
Distillation	On-Policy Distillation	Minimizes Reverse KL (RKL) using teacher guidance on student-sampled trajectories. docs/en/algorithms/distillation.md30-39

Sources: areal/trainer/rl_trainer.py105-108 areal/trainer/sft_trainer.py54-57 areal/trainer/rw_trainer.py76-80 areal/trainer/dpo_trainer.py84-87 docs/en/algorithms/distillation.md1-13

Algorithm Architecture

Trainer Orchestration

The PPOTrainer (and related trainers like SFTTrainer, DPOTrainer, and RWTrainer) orchestrates the lifecycle of model components. In AReaL, the Actor (policy) and Critic (value function) are managed by specialized classes that interface with the underlying TrainEngine.

Logic to Code Mapping: Trainer Components

The following diagram bridges the conceptual RL roles to the specific classes and files that implement them.

Sources: areal/trainer/rl_trainer.py105-111 areal/trainer/sft_trainer.py54-60 areal/trainer/dpo_trainer.py84-90 areal/trainer/rw_trainer.py76-82 areal/infra/controller/train_controller.py174-194 areal/trainer/sft/lm_engine.py52-53

Configuration Hierarchy

Algorithms are configured using specialized dataclasses like PPOActorConfig and PPOCriticConfig. These control critical parameters such as KL regularization, advantage normalization, and proximal policy approximation methods.

Configuration to Code Mapping

This diagram shows how YAML-based configurations are ingested into the actor and engine logic.

Sources: areal/trainer/rl_trainer.py117-120 areal/trainer/sft_trainer.py65-68 areal/trainer/dpo_trainer.py95-98 areal/api/cli_args.py27-29 examples/distillation/gsm8k_grpo_distill.yaml93

Implementation Features

Decoupled and Off-Policy Training

AReaL implements a "Decoupled PPO" mode which is essential for its asynchronous architecture. In this mode, the behavior policy ($\pi_{behave}$) used for rollouts is distinct from the proximal policy ($\pi_{prox}$) used for the surrogate objective areal/trainer/ppo/actor.py89-91

Proximal Approximation: Supports multiple methods for computing $\pi_{prox}$, including PROX_LOGP_METHOD_RECOMPUTE (forward pass) or PROX_LOGP_METHOD_LOGLINEAR (log-linear approximation) areal/trainer/ppo/actor.py93-101
On-Policy Distillation: Enables student models to learn from teacher models via a joint loss strategy, combining GRPO with a Reverse KL penalty docs/en/algorithms/distillation.md59-66

Data Handling and Collation

Different algorithms require specific data formatting, handled by custom collation functions to prepare batches for the TrainEngine.

DPO Collation: The dpo_modeling_collate_fn prepares pairs of chosen and rejected sequences, each with its own loss_mask areal/trainer/dpo_trainer.py54-81
RW Collation: The rw_modeling_collate_fn similarly prepares pairs but specifically for reward modeling areal/trainer/rw_trainer.py54-73
Token Denominator Inference: The infer_token_denominator utility ensures accurate normalization by identifying valid tokens from attention masks or sequence lengths areal/trainer/ppo/stats.py7-17

Distributed Training Orchestration

The TrainController manages the lifecycle of distributed workers across various backends (FSDP, Megatron, Archon).

Data Dispatch: Partitions trajectories across data-parallel (DP) groups using balanced token counts via _dispatch_tensors areal/infra/controller/train_controller.py76-122
Balanced Partitioning: Uses balanced_greedy_partition to ensure that workers receive approximately equal computational loads even with varying sequence lengths areal/infra/controller/train_controller.py103

Sources: areal/infra/controller/train_controller.py189-204 areal/trainer/dpo_trainer.py54-81 areal/trainer/rw_trainer.py54-73 areal/trainer/ppo/stats.py7-17 docs/en/algorithms/distillation.md59-78

URL: https://deepwiki.com/inclusionAI/AReaL/7-rl-algorithms

⇱ RL Algorithms | inclusionAI/AReaL | DeepWiki