Last indexed: 7 May 2026 (2e12c1)

GSM8K Math Reasoning

This page provides a tutorial for training language models on the GSM8K math reasoning dataset using Group Relative Policy Optimization (GRPO). It demonstrates AReaL's capabilities for reinforcement learning on reasoning tasks by walking through a complete example from dataset loading to model training.

Scope: This page covers the practical aspects of running GSM8K training experiments. For details on the GRPO algorithm configuration, see areal/api/cli_args.py12 For information about customizing datasets, see areal/dataset.py6 For agentic workflows involving math reasoning and tool use, see examples/tir/train_tir.py58

Overview

GSM8K (Grade School Math 8K) is a dataset of grade school math word problems that require multi-step reasoning. Each problem consists of a natural language question and a numerical answer. The dataset is used to train and evaluate models on their mathematical reasoning capabilities.

The GSM8K training example demonstrates:

Loading and processing the GSM8K dataset via HuggingFace examples/math/gsm8k_grpo.yaml133
Generating multiple solution candidates per problem (Group Sampling) using n_samples examples/math/gsm8k_grpo.yaml36
Computing rewards based on correctness of final answers using specialized math verifiers examples/math/boba_grpo.py48-53
Optimizing the policy using GRPO with group-level normalization examples/math/gsm8k_grpo.yaml75-78
Tracking training metrics and saving checkpoints examples/math/gsm8k_grpo.yaml145-180

Sources: examples/math/gsm8k_grpo.yaml35-40 examples/math/gsm8k_grpo.yaml128-136 examples/math/boba_grpo.py34-41

Quick Start

Single Node Training

Run GSM8K training on a single node with the local scheduler:

This command launches training with the default configuration, using 8 GPUs allocated for inference (SGLang) and training (FSDP) as defined in the rollout.backend and actor.backend specifications examples/math/gsm8k_grpo.yaml22-43

Multi-Node Training

For distributed training across multiple nodes using Ray:

Note: Ensure paths in the YAML configuration point to shared storage accessible by all nodes examples/math/gsm8k_grpo.yaml12-15

Sources: examples/math/gsm8k_grpo.yaml1-21 examples/math/boba_grpo.py62-91

System Architecture for GSM8K Training

The following diagram maps the logical components of the GSM8K reasoning task to the specific code entities in the AReaL framework.

GSM8K Reasoning Component Mapping

Sources: examples/math/boba_grpo.py62-87 examples/math/gsm8k_grpo.yaml21-103 examples/tir/train_tir.py15-24

Training Pipeline Flow

The GRPO pipeline specifically focuses on group-based sampling where n_samples solutions are generated for each prompt to calculate relative advantages within the group.

GSM8K GRPO Sequence Flow

Sources: examples/math/boba_grpo.py44-60 examples/countdown/train.py33-67 examples/math/gsm8k_grpo.yaml35-40

Configuration Structure

The GSM8K GRPO configuration file follows the standard AReaL configuration format with specific settings for math reasoning tasks.

Key Configuration Sections

Section	Purpose	Key Parameters
`rollout.backend`	Inference allocation	`sglang:d4p1t1` (4 GPUs for inference) examples/math/gsm8k_grpo.yaml22
`actor.backend`	Training allocation	`fsdp:d4p1t1` (4 GPUs for training) examples/math/gsm8k_grpo.yaml43
`gconfig`	Generation parameters	`n_samples: 4`, `max_new_tokens: 1024` examples/math/gsm8k_grpo.yaml36-38
`actor`	Policy model config	`path: Qwen/Qwen2.5-1.5B-Instruct`, `lr: 1.70e-5` examples/math/gsm8k_grpo.yaml46-56
`reward_norm`	GRPO Normalization	`mean_level: group`, `std_level: group` examples/math/gsm8k_grpo.yaml76-77

Group Sampling Configuration

The n_samples parameter is crucial for GRPO—it determines the group size for relative advantage calculation examples/math/gsm8k_grpo.yaml36-40

Reward Computation

Math Answer Verification

The system uses a math_verify_worker to handle the complexities of parsing mathematical expressions and comparing them to ground truth answers.

Sources: examples/math/boba_grpo.py44-60 examples/tir/train_tir.py15-24 examples/math/boba_grpo.py48

Training Variations

Supervised Fine-Tuning (SFT)

Before RL training, models are often fine-tuned on GSM8K using the SFTTrainer and SFTConfig. This provides a stable starting policy for the reasoning task examples/math/gsm8k_sft.py25-28 examples/math/gsm8k_sft.yaml1-60

Distilled Model Training (LRM)

AReaL supports training distilled models (e.g., Qwen2.5-1.5B-Instruct) to achieve high performance on math reasoning benchmarks. This often involves iterative context lengthening and specialized prompt templates examples/math/gsm8k_grpo.yaml46

Tool-Integrated Reasoning (TIR)

GSM8K training can be extended to use external tools (like a Python interpreter or calculator) by switching to the TIRWorkflow. This involves a TIRGRPOConfig and a specialized reward function examples/tir/train_tir.py10-24 examples/tir/train_tir.py58

Multi-Turn Math Training

AReaL supports training multi-turn math agents by utilizing the ArealOpenAI client in concat mode. This allows for the construction of conversation trees where rewards can be assigned and discounted across multiple turns examples/math/boba_grpo.py69-76

Allocation and Parallelism

The backend strings define the physical layout of the engines:

sglang:d4p1t1: SGLang inference engine using 4 GPUs with Data Parallelism (d4), Pipeline Parallelism 1 (p1), and Tensor Parallelism 1 (t1) examples/math/gsm8k_grpo.yaml22
fsdp:d4p1t1: The training engine (FSDP) assigned to 4 GPUs examples/math/gsm8k_grpo.yaml43

The weight_update_mode is set to xccl to enable high-speed weight synchronization between training and inference engines via the XCCL protocol examples/math/gsm8k_grpo.yaml82

Sources: examples/math/gsm8k_grpo.yaml22-82 examples/vlm/clevr_count_70k_grpo.yaml22-43

Monitoring

Training progress is tracked through stats_tracker. Key metrics include:

reward: The score returned by the reward function examples/countdown/train.py56
kl_divergence: Controlled by actor.kl_ctl to prevent the policy from drifting too far from the reference model examples/math/gsm8k_grpo.yaml68
loss: The GRPO policy loss computed by the TrainEngine.

Sources: examples/countdown/train.py50-67 examples/math/gsm8k_grpo.yaml64-81

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.1-gsm8k-math-reasoning