VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/7.3-grpo-and-variants

⇱ GRPO and Variants | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

GRPO and Variants

Purpose and Scope

This page documents the Group Relative Policy Optimization (GRPO) algorithm family and its variants implemented in AReaL. These algorithms provide policy gradient-based reinforcement learning methods for training large language models, specifically optimized for reasoning tasks by using group-based advantage estimation instead of a separate critic network.

Algorithms Covered:

Sources: docs/en/algorithms/grpo_series.md9-24 examples/math/gsm8k_grpo.yaml1-2


Algorithm Overview

Core Concept

GRPO-family algorithms optimize language model policies by comparing rewards across groups of sampled responses for the same prompt. Traditional PPO requires a separate value network (critic) to estimate advantages ($A = R - V$). In contrast, GRPO estimates advantages by computing the mean and standard deviation of rewards within a group of $G$ samples ($G \ge 2$).

Key Characteristics:

Sources: docs/en/algorithms/grpo_series.md150-164 docs/en/algorithms/grpo_series.md130-137 examples/math/gsm8k_dapo_dynamic_bs.yaml64-65

Algorithm Comparison


Sources: docs/en/algorithms/grpo_series.md143-156 docs/en/algorithms/grpo_series.md88-92 examples/math/gsm8k_gspo.yaml74-83


Configuration Structure

Algorithm Configuration Matrix

AReaL allows switching between algorithms by adjusting NormConfig and PPOActorConfig parameters in the YAML configuration.

Algorithmadv_norm.mean_leveladv_norm.std_levelmean_leave1outimportance_sampling_levelSpecial Config
PPObatchbatchfalsetokenNeeds critic: section docs/en/algorithms/grpo_series.md145-146
GRPObatchbatchfalsetokenStandard DeepSeekMath docs/en/algorithms/grpo_series.md143-145
Dr.GRPOgroupnullfalsetokenImproved stability docs/en/algorithms/grpo_series.md133
RLOOgroupnulltruetokenLeave-one-out baseline docs/en/algorithms/grpo_series.md165-167
GSPObatchbatchfalsesequenceimportance_sampling_level: sequence examples/math/gsm8k_gspo.yaml83
DAPObatchbatchfalsetokeneps_clip_higher set examples/math/gsm8k_dapo_dynamic_bs.yaml65
SAPObatchbatchfalsetokenuse_sapo_loss: true docs/en/algorithms/grpo_series.md62-65

Sources: docs/en/algorithms/grpo_series.md138-151 examples/math/gsm8k_gspo.yaml74-83 examples/math/gsm8k_dapo_dynamic_bs.yaml64-86


Implementation Architecture

Vision Support in Group Optimization

For Vision-Language Models (VLMs), AReaL uses VisionSPShard to distribute vision encoder work across ranks while maintaining the full sequence for group-based policy optimization.


Sources: areal/models/transformers/vision_sp_shard.py56-60 areal/models/transformers/vision_sp_shard.py100-105 areal/models/transformers/vision_sp_shard.py145-146

Natural Language Space to Code Entity Space

This diagram maps theoretical RL concepts to specific code entities and configuration keys in the AReaL repository.


Sources: docs/en/algorithms/grpo_series.md77-87 examples/math/gsm8k_dapo_dynamic_bs.yaml1-10 examples/math/gsm8k_gspo.yaml81-83


Normalization Configuration Details

The normalization logic is controlled via the NormConfig structure, allowing fine-grained control over how rewards are transformed into advantages for the policy gradient.

ParameterTypeOptionsDescription
mean_levelstr"batch", "group", NoneLevel to compute mean for centering docs/en/algorithms/grpo_series.md81
std_levelstr"batch", "group", NoneLevel to compute std for scaling docs/en/algorithms/grpo_series.md82
mean_leave1outbooltrue, falseIf true, excludes current sample from mean (RLOO-style) docs/en/algorithms/grpo_series.md83
group_sizeintintNumber of samples per prompt for group-level stats docs/en/algorithms/grpo_series.md86

Sources: docs/en/algorithms/grpo_series.md75-97 examples/math/gsm8k_grpo.yaml74-77


Advanced Algorithm Features

Dynamic Batch Sizing

Used in DAPO variants, dynamic batch sizing adjusts the effective training load based on sequence lengths or resource availability to maintain high throughput even with variable response lengths examples/math/gsm8k_dapo_dynamic_bs.yaml5

Sequence-level Importance Sampling (GSPO)

Unlike standard GRPO which computes importance ratios per token, GSPO (Group Sparse Policy Optimization) computes a single ratio for the entire sequence (geometric mean of token ratios) to stabilize training in high-variance reasoning tasks docs/en/algorithms/grpo_series.md133-137 examples/math/gsm8k_gspo.yaml83

Asymmetric Clipping

Configured via eps_clip (lower) and eps_clip_higher (upper). This allows the policy to diverge more aggressively in one direction (usually towards higher rewards) while remaining constrained in the other docs/en/algorithms/grpo_series.md117-128 examples/math/gsm8k_dapo_dynamic_bs.yaml64-65

Vision SP Sharding

For large vision models, AReaL supports distributing ViT computation across Ulysses SP ranks. Whole images are assigned to ranks via greedy contiguous bin-packing areal/models/transformers/vision_sp_shard.py56-60 The embeddings are then gathered via a custom autograd function GatherVisionEmbeddings which handles all-gather in the forward pass and all-reduce in the backward pass areal/models/transformers/vision_sp_shard.py145-150

Sources: docs/en/algorithms/grpo_series.md117-137 examples/math/gsm8k_gspo.yaml81-83 examples/math/gsm8k_dapo_dynamic_bs.yaml64-75 areal/models/transformers/vision_sp_shard.py1-15