VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/7.1-algorithm-overview

⇱ Algorithm Overview | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Algorithm Overview

Purpose and Scope

This page provides a comprehensive survey of reinforcement learning (RL) algorithms supported by AReaL, their characteristics, and configuration structure. AReaL is designed as a fully asynchronous RL training system, supporting high-throughput training for large reasoning and agentic models. It achieves this by decoupling the rollout generation from the model update phase, allowing for high hardware utilization across heterogeneous clusters.

Supported Algorithms

AReaL implements a wide range of RL algorithms optimized for language model alignment. All algorithms support both synchronous and asynchronous execution modes.

Algorithm Matrix

AlgorithmCategoryCritic RequiredKey FeatureConfiguration Class
GRPOOn-policyNoGroup relative policy optimization`PPOActorConfig`
GSPOOn-policyNoSequence-level importance sampling`PPOActorConfig`
PPOOn-policyOptionalProximal policy optimization with clipping`PPOActorConfig` + `PPOCriticConfig`
DAPOOn-policyNoDynamic batch size policy optimization`PPOActorConfig`
LitePPOOn-policyNoLightweight PPO without value function`PPOActorConfig`
RLOOOn-policyNoREINFORCE Leave-One-Out`PPOActorConfig`
SAPOOn-policyNoSoft adaptive policy optimization`PPOActorConfig`
Dr.GRPOOn-policyNoImproved GRPO with specific norm levels`PPOActorConfig`
M2POOn-policyNoSecond-Moment Trust Policy Optimization`PPOActorConfig`
DPOOffline/PrefNoDirect Preference Optimization`DPOConfig`
Reward ModelingSupervisedN/ABradley-Terry preference learning`RWConfig`
SFTSupervisedN/ASupervised fine-tuning`SFTConfig`

Sources: areal/api/cli_args.py918-1309 docs/en/algorithms/grpo_series.md9-20 areal/trainer/dpo_trainer.py84-188

Algorithm Configuration Hierarchy

The following diagram bridges the abstract algorithm selection to the concrete configuration fields used in the codebase.


Sources: areal/api/cli_args.py799-1309 docs/en/algorithms/grpo_series.md45-66 areal/trainer/dpo_trainer.py84-116

Algorithm Categories

On-Policy Algorithms (PPO Family)

All on-policy algorithms in AReaL share the `PPOActorConfig` configuration structure but differ in specific parameter settings. These algorithms collect trajectories using the current policy and update the policy using those trajectories.

Common Configuration Parameters:

GRPO (Group Relative Policy Optimization)

GRPO optimizes the policy by comparing samples within groups. It does not require a critic model, making it computationally efficient.

GSPO (Group Sequence Policy Optimization)

GSPO uses sequence-level importance sampling, typically employing the geometric mean of per-token ratios.

Dr.GRPO

An improved variant of GRPO that adjusts normalization levels for better stability.

RLOO (REINFORCE Leave-One-Out)

RLOO estimates the baseline by averaging rewards of other sampled responses for the same prompt, excluding the current sample.

M2PO (Second-Moment Trust Policy Optimization)

M2PO is designed for stable off-policy training, allowing data to be stale. It constrains the second moment of importance weights.

Key Configuration Parameters

Algorithm Selection Logic

This diagram maps configuration flags in PPOActorConfig to the resulting loss behavior in the training engines.


Sources: areal/api/cli_args.py993-1032 docs/en/algorithms/grpo_series.md58-66

Reward Processing Pipeline

The PPOActor logic (configured via PPOActorConfig) manages the reward processing pipeline before loss computation.


Sources: areal/api/cli_args.py945-985 docs/en/algorithms/grpo_series.md75-97

Supervised and Preference Algorithms

DPO (Direct Preference Optimization)

DPO directly optimizes the policy using preference pairs without a separate reward model.

SFT (Supervised Fine-Tuning)

SFT trains the model to maximize the log-likelihood of target sequences.

Reward Modeling (RW)

Trains a reward model using preference pairs (chosen vs rejected).

Asynchronous Training Support

All RL algorithms support asynchronous training via rollout-training decoupling. This is managed by the PPOTrainer which orchestrates the actor, critic, and reference engines areal/trainer/rl_trainer.py105-189

Key Parameters for Async Stability:

Sources: areal/trainer/rl_trainer.py105-189 areal/api/cli_args.py918-1106

Algorithm Selection Guide

Typical Use Cases

Use CaseRecommended AlgorithmRationale
Math reasoning (GSM8K)GRPOSimple, no critic needed, efficient group-based advantages docs/en/algorithms/grpo_series.md150-156
High Stability ReasoningDr.GRPOUses group-level mean centering for reduced variance docs/en/algorithms/grpo_series.md133-134
Agentic tasksPPO with criticValue function helps with credit assignment in complex trajectories areal/api/cli_args.py1109-1111
Offline Preference TuningDPOStable training on static preference datasets areal/trainer/dpo_trainer.py84-116

Sources: docs/en/algorithms/grpo_series.md areal/api/cli_args.py areal/trainer/dpo_trainer.py

Core Implementation Entities

PPOTrainer

The primary orchestrator for PPO-family algorithms, managing engine creation and the training loop areal/trainer/rl_trainer.py105-200

SFTTrainer

Manages the supervised fine-tuning lifecycle areal/trainer/sft_trainer.py54-147

RWTrainer / DPOTrainer

Orchestrate preference-based training loops for reward models or direct policy optimization areal/trainer/rw_trainer.py76-180 areal/trainer/dpo_trainer.py84-147

TrainController

Handles the dispatch of tensors and data splitting across data-parallel groups during training areal/infra/controller/train_controller.py174-205

Sources: areal/trainer/rl_trainer.py areal/trainer/sft_trainer.py areal/trainer/rw_trainer.py areal/trainer/dpo_trainer.py areal/infra/controller/train_controller.py