VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/7-rl-algorithms

⇱ RL Algorithms | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

RL Algorithms

This page provides an overview of the reinforcement learning algorithms supported by AReaL, their implementations, and how to configure them. AReaL is designed as a fully asynchronous RL system, supporting both synchronous and asynchronous execution modes for all algorithms.

Supported Algorithms

AReaL supports a comprehensive suite of RL algorithms optimized for large language model alignment and reasoning tasks, with a particular focus on group-based, off-policy variants, and on-policy distillation.

On-Policy and Group-Based Algorithms

AlgorithmDescriptionKey Features
GRPOGroup Relative Policy OptimizationUses group-based advantage normalization; no value network required. areal/trainer/ppo/actor.py68-69
PPOProximal Policy OptimizationClipped surrogate objective with actor-critic setup. areal/trainer/ppo/actor.py81-84
DAPODynamic Batch PPODynamic adaptation for training efficiency. areal/trainer/ppo/actor.py68-69
GSPOGuided Self-Play OptimizationSelf-play with guided exploration. areal/trainer/ppo/actor.py68-69
RLOOREINFORCE Leave-One-OutVariance reduction via leave-one-out baseline. areal/trainer/rl_trainer.py28
M2POSecond-Moment Trust Policy OptimizationConstrains second moment of weights for stable off-policy training. areal/trainer/ppo/actor.py66

Supervised, Preference, and Reward Learning

AlgorithmDescriptionImplementation
SFTSupervised Fine-TuningStandard supervised training logic via SFTTrainer. areal/trainer/sft_trainer.py54
DPODirect Preference OptimizationOptimizes policy directly from preferences without an explicit reward model. areal/trainer/dpo_trainer.py84
RWBradley-Terry Reward ModelingPreference-based reward model training via RWTrainer. areal/trainer/rw_trainer.py76
DistillationOn-Policy DistillationMinimizes Reverse KL (RKL) using teacher guidance on student-sampled trajectories. docs/en/algorithms/distillation.md30-39

Sources: areal/trainer/rl_trainer.py105-108 areal/trainer/sft_trainer.py54-57 areal/trainer/rw_trainer.py76-80 areal/trainer/dpo_trainer.py84-87 docs/en/algorithms/distillation.md1-13

Algorithm Architecture

Trainer Orchestration

The PPOTrainer (and related trainers like SFTTrainer, DPOTrainer, and RWTrainer) orchestrates the lifecycle of model components. In AReaL, the Actor (policy) and Critic (value function) are managed by specialized classes that interface with the underlying TrainEngine.

Logic to Code Mapping: Trainer Components

The following diagram bridges the conceptual RL roles to the specific classes and files that implement them.


Sources: areal/trainer/rl_trainer.py105-111 areal/trainer/sft_trainer.py54-60 areal/trainer/dpo_trainer.py84-90 areal/trainer/rw_trainer.py76-82 areal/infra/controller/train_controller.py174-194 areal/trainer/sft/lm_engine.py52-53

Configuration Hierarchy

Algorithms are configured using specialized dataclasses like PPOActorConfig and PPOCriticConfig. These control critical parameters such as KL regularization, advantage normalization, and proximal policy approximation methods.

Configuration to Code Mapping

This diagram shows how YAML-based configurations are ingested into the actor and engine logic.


Sources: areal/trainer/rl_trainer.py117-120 areal/trainer/sft_trainer.py65-68 areal/trainer/dpo_trainer.py95-98 areal/api/cli_args.py27-29 examples/distillation/gsm8k_grpo_distill.yaml93

Implementation Features

Decoupled and Off-Policy Training

AReaL implements a "Decoupled PPO" mode which is essential for its asynchronous architecture. In this mode, the behavior policy ($\pi_{behave}$) used for rollouts is distinct from the proximal policy ($\pi_{prox}$) used for the surrogate objective areal/trainer/ppo/actor.py89-91

  • Proximal Approximation: Supports multiple methods for computing $\pi_{prox}$, including PROX_LOGP_METHOD_RECOMPUTE (forward pass) or PROX_LOGP_METHOD_LOGLINEAR (log-linear approximation) areal/trainer/ppo/actor.py93-101
  • On-Policy Distillation: Enables student models to learn from teacher models via a joint loss strategy, combining GRPO with a Reverse KL penalty docs/en/algorithms/distillation.md59-66

Data Handling and Collation

Different algorithms require specific data formatting, handled by custom collation functions to prepare batches for the TrainEngine.

  • DPO Collation: The dpo_modeling_collate_fn prepares pairs of chosen and rejected sequences, each with its own loss_mask areal/trainer/dpo_trainer.py54-81
  • RW Collation: The rw_modeling_collate_fn similarly prepares pairs but specifically for reward modeling areal/trainer/rw_trainer.py54-73
  • Token Denominator Inference: The infer_token_denominator utility ensures accurate normalization by identifying valid tokens from attention masks or sequence lengths areal/trainer/ppo/stats.py7-17

Distributed Training Orchestration

The TrainController manages the lifecycle of distributed workers across various backends (FSDP, Megatron, Archon).

Sources: areal/infra/controller/train_controller.py189-204 areal/trainer/dpo_trainer.py54-81 areal/trainer/rw_trainer.py54-73 areal/trainer/ppo/stats.py7-17 docs/en/algorithms/distillation.md59-78

Related Documentation

  • Algorithm Overview — Survey of PPO, GRPO, DAPO, GSPO, RLOO, DPO, and other supported algorithms.
  • PPO Implementation — PPO-specific configurations, actor-critic setup, and loss computation.
  • GRPO and Variants — GRPO, DAPO, GSPO implementations and group-based optimization.
  • Trainer Orchestration — How PPOTrainer, SFTTrainer, GRPOTrainer, and DPOTrainer orchestrate training loops.
  • Asynchronous Training — Async rollout/training loop, offpolicyness checks, and version tracking.
  • Reference Model and Critic — Reference model colocation, critic training, and KL regularization.
  • Advanced Algorithms — M2PO, on-policy distillation, proximal log-probability approximation, and other advanced techniques.
  • DPO Implementation — Direct Preference Optimization (DPO) and IPO variants.