Last indexed: 7 May 2026 (2e12c1)

Algorithm Overview

Purpose and Scope

This page provides a comprehensive survey of reinforcement learning (RL) algorithms supported by AReaL, their characteristics, and configuration structure. AReaL is designed as a fully asynchronous RL training system, supporting high-throughput training for large reasoning and agentic models. It achieves this by decoupling the rollout generation from the model update phase, allowing for high hardware utilization across heterogeneous clusters.

Supported Algorithms

AReaL implements a wide range of RL algorithms optimized for language model alignment. All algorithms support both synchronous and asynchronous execution modes.

Algorithm Matrix

Algorithm	Category	Critic Required	Key Feature	Configuration Class
GRPO	On-policy	No	Group relative policy optimization	`PPOActorConfig`
GSPO	On-policy	No	Sequence-level importance sampling	`PPOActorConfig`
PPO	On-policy	Optional	Proximal policy optimization with clipping	`PPOActorConfig` + `PPOCriticConfig`
DAPO	On-policy	No	Dynamic batch size policy optimization	`PPOActorConfig`
LitePPO	On-policy	No	Lightweight PPO without value function	`PPOActorConfig`
RLOO	On-policy	No	REINFORCE Leave-One-Out	`PPOActorConfig`
SAPO	On-policy	No	Soft adaptive policy optimization	`PPOActorConfig`
Dr.GRPO	On-policy	No	Improved GRPO with specific norm levels	`PPOActorConfig`
M2PO	On-policy	No	Second-Moment Trust Policy Optimization	`PPOActorConfig`
DPO	Offline/Pref	No	Direct Preference Optimization	`DPOConfig`
Reward Modeling	Supervised	N/A	Bradley-Terry preference learning	`RWConfig`
SFT	Supervised	N/A	Supervised fine-tuning	`SFTConfig`

Sources: areal/api/cli_args.py918-1309 docs/en/algorithms/grpo_series.md9-20 areal/trainer/dpo_trainer.py84-188

Algorithm Configuration Hierarchy

The following diagram bridges the abstract algorithm selection to the concrete configuration fields used in the codebase.

Sources: areal/api/cli_args.py799-1309 docs/en/algorithms/grpo_series.md45-66 areal/trainer/dpo_trainer.py84-116

Algorithm Categories

On-Policy Algorithms (PPO Family)

All on-policy algorithms in AReaL share the `PPOActorConfig` configuration structure but differ in specific parameter settings. These algorithms collect trajectories using the current policy and update the policy using those trajectories.

Common Configuration Parameters:

eps_clip (float): Clipping factor for policy ratio, typically 0.2 areal/api/cli_args.py922-924
kl_ctl (float): KL divergence penalty coefficient areal/api/cli_args.py986-988
reward_norm (`NormConfig`): Reward normalization settings.
adv_norm (`NormConfig`): Advantage normalization settings.
ppo_n_minibatches (int): Number of minibatches per update areal/api/cli_args.py934-936

GRPO (Group Relative Policy Optimization)

GRPO optimizes the policy by comparing samples within groups. It does not require a critic model, making it computationally efficient.

Key Parameters: critic=None, reward_norm.mean_level='group' docs/en/algorithms/grpo_series.md142-145
Logic: Uses group-based advantages $\hat{A}_{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}$ docs/en/algorithms/grpo_series.md162-163

GSPO (Group Sequence Policy Optimization)

GSPO uses sequence-level importance sampling, typically employing the geometric mean of per-token ratios.

Key Parameters: importance_sampling_level='sequence' docs/en/algorithms/grpo_series.md129-137

Dr.GRPO

An improved variant of GRPO that adjusts normalization levels for better stability.

Key Parameters: actor.adv_norm.mean_level='group', actor.adv_norm.std_level=null docs/en/algorithms/grpo_series.md45-51

RLOO (REINFORCE Leave-One-Out)

RLOO estimates the baseline by averaging rewards of other sampled responses for the same prompt, excluding the current sample.

Key Parameters: actor.adv_norm.mean_level='group', actor.adv_norm.mean_leave1out=true docs/en/algorithms/grpo_series.md165-168

M2PO (Second-Moment Trust Policy Optimization)

M2PO is designed for stable off-policy training, allowing data to be stale. It constrains the second moment of importance weights.

Key Parameter: actor.m2_threshold ($\tau_{M_2}$) areal/api/cli_args.py1003-1005

Key Configuration Parameters

Algorithm Selection Logic

This diagram maps configuration flags in PPOActorConfig to the resulting loss behavior in the training engines.

Sources: areal/api/cli_args.py993-1032 docs/en/algorithms/grpo_series.md58-66

Reward Processing Pipeline

The PPOActor logic (configured via PPOActorConfig) manages the reward processing pipeline before loss computation.

Sources: areal/api/cli_args.py945-985 docs/en/algorithms/grpo_series.md75-97

Supervised and Preference Algorithms

DPO (Direct Preference Optimization)

DPO directly optimizes the policy using preference pairs without a separate reward model.

Orchestration: Managed by DPOTrainer areal/trainer/dpo_trainer.py84-147
Data Flow: dpo_modeling_collate_fn prepares chosen and rejected sequence pairs with their respective loss_mask areal/trainer/dpo_trainer.py54-81
Reference Model: Requires a ref configuration block for the reference policy areal/trainer/dpo_trainer.py114-116

SFT (Supervised Fine-Tuning)

SFT trains the model to maximize the log-likelihood of target sequences.

Orchestration: Managed by SFTTrainer areal/trainer/sft_trainer.py54-147
Loop: Iterates through train_dataloader performing standard gradient descent areal/trainer/sft_trainer.py161-195

Reward Modeling (RW)

Trains a reward model using preference pairs (chosen vs rejected).

Orchestration: Managed by RWTrainer areal/trainer/rw_trainer.py76-180
Data Flow: rw_modeling_collate_fn produces paired tensors for chosen and rejected responses areal/trainer/rw_trainer.py54-73

Asynchronous Training Support

All RL algorithms support asynchronous training via rollout-training decoupling. This is managed by the PPOTrainer which orchestrates the actor, critic, and reference engines areal/trainer/rl_trainer.py105-189

Key Parameters for Async Stability:

recompute_logprob: Ensures the ratio is calculated against current model weights during updates areal/api/cli_args.py926-928
use_decoupled_loss: Enables off-policy correction when the behavior policy used for rollouts differs from the training policy areal/api/cli_args.py1011-1013

Sources: areal/trainer/rl_trainer.py105-189 areal/api/cli_args.py918-1106

Algorithm Selection Guide

Typical Use Cases

Use Case	Recommended Algorithm	Rationale
Math reasoning (GSM8K)	GRPO	Simple, no critic needed, efficient group-based advantages docs/en/algorithms/grpo_series.md150-156
High Stability Reasoning	Dr.GRPO	Uses group-level mean centering for reduced variance docs/en/algorithms/grpo_series.md133-134
Agentic tasks	PPO with critic	Value function helps with credit assignment in complex trajectories areal/api/cli_args.py1109-1111
Offline Preference Tuning	DPO	Stable training on static preference datasets areal/trainer/dpo_trainer.py84-116

Sources: docs/en/algorithms/grpo_series.md areal/api/cli_args.py areal/trainer/dpo_trainer.py

Core Implementation Entities

PPOTrainer

The primary orchestrator for PPO-family algorithms, managing engine creation and the training loop areal/trainer/rl_trainer.py105-200

SFTTrainer

Manages the supervised fine-tuning lifecycle areal/trainer/sft_trainer.py54-147

RWTrainer / DPOTrainer

Orchestrate preference-based training loops for reward models or direct policy optimization areal/trainer/rw_trainer.py76-180 areal/trainer/dpo_trainer.py84-147

TrainController

Handles the dispatch of tensors and data splitting across data-parallel groups during training areal/infra/controller/train_controller.py174-205

Sources: areal/trainer/rl_trainer.py areal/trainer/sft_trainer.py areal/trainer/rw_trainer.py areal/trainer/dpo_trainer.py areal/infra/controller/train_controller.py

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/7.1-algorithm-overview