Last indexed: 7 May 2026 (2e12c1)

GRPO and Variants

Purpose and Scope

This page documents the Group Relative Policy Optimization (GRPO) algorithm family and its variants implemented in AReaL. These algorithms provide policy gradient-based reinforcement learning methods for training large language models, specifically optimized for reasoning tasks by using group-based advantage estimation instead of a separate critic network.

Algorithms Covered:

GRPO: Group Relative Policy Optimization (Standard) docs/en/algorithms/grpo_series.md13
GSPO: Group Sparse Policy Optimization (Qwen3) docs/en/algorithms/grpo_series.md18-20
DAPO: Dynamic Adaptation Policy Optimization docs/en/algorithms/grpo_series.md17
Dr.GRPO: Improved GRPO variant docs/en/algorithms/grpo_series.md14
RLOO: REINFORCE Leave-One-Out docs/en/algorithms/grpo_series.md15
SAPO: Self-Adaptive Policy Optimization docs/en/algorithms/grpo_series.md18
LitePPO: Efficient PPO variant using group normalization docs/en/algorithms/grpo_series.md14

Sources: docs/en/algorithms/grpo_series.md9-24 examples/math/gsm8k_grpo.yaml1-2

Algorithm Overview

Core Concept

GRPO-family algorithms optimize language model policies by comparing rewards across groups of sampled responses for the same prompt. Traditional PPO requires a separate value network (critic) to estimate advantages ($A = R - V$). In contrast, GRPO estimates advantages by computing the mean and standard deviation of rewards within a group of $G$ samples ($G \ge 2$).

Key Characteristics:

Group-based advantage estimation: Advantages are computed relative to group statistics: $\hat{A}_{i,t} = \frac{r_i - \text{mean}(R)}{\text{std}(R)}$ docs/en/algorithms/grpo_series.md162-163
No critic network required: Significantly reduces VRAM usage and computational overhead by removing the need to host and train a separate value model docs/en/algorithms/grpo_series.md150-156
Sequence-level Importance Sampling (GSPO): Computes the geometric mean of per-token ratios for sequence-level optimization docs/en/algorithms/grpo_series.md136-137
Asymmetric Clipping (DAPO): Uses different upper and lower bounds for ratio clipping docs/en/algorithms/grpo_series.md126-128

Sources: docs/en/algorithms/grpo_series.md150-164 docs/en/algorithms/grpo_series.md130-137 examples/math/gsm8k_dapo_dynamic_bs.yaml64-65

Algorithm Comparison

Sources: docs/en/algorithms/grpo_series.md143-156 docs/en/algorithms/grpo_series.md88-92 examples/math/gsm8k_gspo.yaml74-83

Configuration Structure

Algorithm Configuration Matrix

AReaL allows switching between algorithms by adjusting NormConfig and PPOActorConfig parameters in the YAML configuration.

Algorithm	`adv_norm.mean_level`	`adv_norm.std_level`	`mean_leave1out`	`importance_sampling_level`	Special Config
PPO	`batch`	`batch`	`false`	`token`	Needs `critic:` section docs/en/algorithms/grpo_series.md145-146
GRPO	`batch`	`batch`	`false`	`token`	Standard DeepSeekMath docs/en/algorithms/grpo_series.md143-145
Dr.GRPO	`group`	`null`	`false`	`token`	Improved stability docs/en/algorithms/grpo_series.md133
RLOO	`group`	`null`	`true`	`token`	Leave-one-out baseline docs/en/algorithms/grpo_series.md165-167
GSPO	`batch`	`batch`	`false`	`sequence`	`importance_sampling_level: sequence` examples/math/gsm8k_gspo.yaml83
DAPO	`batch`	`batch`	`false`	`token`	`eps_clip_higher` set examples/math/gsm8k_dapo_dynamic_bs.yaml65
SAPO	`batch`	`batch`	`false`	`token`	`use_sapo_loss: true` docs/en/algorithms/grpo_series.md62-65

Sources: docs/en/algorithms/grpo_series.md138-151 examples/math/gsm8k_gspo.yaml74-83 examples/math/gsm8k_dapo_dynamic_bs.yaml64-86

Implementation Architecture

Vision Support in Group Optimization

For Vision-Language Models (VLMs), AReaL uses VisionSPShard to distribute vision encoder work across ranks while maintaining the full sequence for group-based policy optimization.

Sources: areal/models/transformers/vision_sp_shard.py56-60 areal/models/transformers/vision_sp_shard.py100-105 areal/models/transformers/vision_sp_shard.py145-146

Natural Language Space to Code Entity Space

This diagram maps theoretical RL concepts to specific code entities and configuration keys in the AReaL repository.

Sources: docs/en/algorithms/grpo_series.md77-87 examples/math/gsm8k_dapo_dynamic_bs.yaml1-10 examples/math/gsm8k_gspo.yaml81-83

Normalization Configuration Details

The normalization logic is controlled via the NormConfig structure, allowing fine-grained control over how rewards are transformed into advantages for the policy gradient.

Parameter	Type	Options	Description
`mean_level`	`str`	`"batch"`, `"group"`, `None`	Level to compute mean for centering docs/en/algorithms/grpo_series.md81
`std_level`	`str`	`"batch"`, `"group"`, `None`	Level to compute std for scaling docs/en/algorithms/grpo_series.md82
`mean_leave1out`	`bool`	`true`, `false`	If true, excludes current sample from mean (RLOO-style) docs/en/algorithms/grpo_series.md83
`group_size`	`int`	`int`	Number of samples per prompt for group-level stats docs/en/algorithms/grpo_series.md86

Sources: docs/en/algorithms/grpo_series.md75-97 examples/math/gsm8k_grpo.yaml74-77

Advanced Algorithm Features

Dynamic Batch Sizing

Used in DAPO variants, dynamic batch sizing adjusts the effective training load based on sequence lengths or resource availability to maintain high throughput even with variable response lengths examples/math/gsm8k_dapo_dynamic_bs.yaml5

Sequence-level Importance Sampling (GSPO)

Unlike standard GRPO which computes importance ratios per token, GSPO (Group Sparse Policy Optimization) computes a single ratio for the entire sequence (geometric mean of token ratios) to stabilize training in high-variance reasoning tasks docs/en/algorithms/grpo_series.md133-137 examples/math/gsm8k_gspo.yaml83

Asymmetric Clipping

Configured via eps_clip (lower) and eps_clip_higher (upper). This allows the policy to diverge more aggressively in one direction (usually towards higher rewards) while remaining constrained in the other docs/en/algorithms/grpo_series.md117-128 examples/math/gsm8k_dapo_dynamic_bs.yaml64-65

Vision SP Sharding

For large vision models, AReaL supports distributing ViT computation across Ulysses SP ranks. Whole images are assigned to ranks via greedy contiguous bin-packing areal/models/transformers/vision_sp_shard.py56-60 The embeddings are then gathered via a custom autograd function GatherVisionEmbeddings which handles all-gather in the forward pass and all-reduce in the backward pass areal/models/transformers/vision_sp_shard.py145-150

Sources: docs/en/algorithms/grpo_series.md117-137 examples/math/gsm8k_gspo.yaml81-83 examples/math/gsm8k_dapo_dynamic_bs.yaml64-75 areal/models/transformers/vision_sp_shard.py1-15

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/7.3-grpo-and-variants