VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/14.1-gsm8k-math-reasoning

⇱ GSM8K Math Reasoning | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

GSM8K Math Reasoning

This page provides a tutorial for training language models on the GSM8K math reasoning dataset using Group Relative Policy Optimization (GRPO). It demonstrates AReaL's capabilities for reinforcement learning on reasoning tasks by walking through a complete example from dataset loading to model training.

Scope: This page covers the practical aspects of running GSM8K training experiments. For details on the GRPO algorithm configuration, see areal/api/cli_args.py12 For information about customizing datasets, see areal/dataset.py6 For agentic workflows involving math reasoning and tool use, see examples/tir/train_tir.py58

Overview

GSM8K (Grade School Math 8K) is a dataset of grade school math word problems that require multi-step reasoning. Each problem consists of a natural language question and a numerical answer. The dataset is used to train and evaluate models on their mathematical reasoning capabilities.

The GSM8K training example demonstrates:

Sources: examples/math/gsm8k_grpo.yaml35-40 examples/math/gsm8k_grpo.yaml128-136 examples/math/boba_grpo.py34-41

Quick Start

Single Node Training

Run GSM8K training on a single node with the local scheduler:


This command launches training with the default configuration, using 8 GPUs allocated for inference (SGLang) and training (FSDP) as defined in the rollout.backend and actor.backend specifications examples/math/gsm8k_grpo.yaml22-43

Multi-Node Training

For distributed training across multiple nodes using Ray:


Note: Ensure paths in the YAML configuration point to shared storage accessible by all nodes examples/math/gsm8k_grpo.yaml12-15

Sources: examples/math/gsm8k_grpo.yaml1-21 examples/math/boba_grpo.py62-91

System Architecture for GSM8K Training

The following diagram maps the logical components of the GSM8K reasoning task to the specific code entities in the AReaL framework.

GSM8K Reasoning Component Mapping


Sources: examples/math/boba_grpo.py62-87 examples/math/gsm8k_grpo.yaml21-103 examples/tir/train_tir.py15-24

Training Pipeline Flow

The GRPO pipeline specifically focuses on group-based sampling where n_samples solutions are generated for each prompt to calculate relative advantages within the group.

GSM8K GRPO Sequence Flow


Sources: examples/math/boba_grpo.py44-60 examples/countdown/train.py33-67 examples/math/gsm8k_grpo.yaml35-40

Configuration Structure

The GSM8K GRPO configuration file follows the standard AReaL configuration format with specific settings for math reasoning tasks.

Key Configuration Sections

SectionPurposeKey Parameters
rollout.backendInference allocationsglang:d4p1t1 (4 GPUs for inference) examples/math/gsm8k_grpo.yaml22
actor.backendTraining allocationfsdp:d4p1t1 (4 GPUs for training) examples/math/gsm8k_grpo.yaml43
gconfigGeneration parametersn_samples: 4, max_new_tokens: 1024 examples/math/gsm8k_grpo.yaml36-38
actorPolicy model configpath: Qwen/Qwen2.5-1.5B-Instruct, lr: 1.70e-5 examples/math/gsm8k_grpo.yaml46-56
reward_normGRPO Normalizationmean_level: group, std_level: group examples/math/gsm8k_grpo.yaml76-77

Group Sampling Configuration


The n_samples parameter is crucial for GRPO—it determines the group size for relative advantage calculation examples/math/gsm8k_grpo.yaml36-40

Reward Computation

Math Answer Verification

The system uses a math_verify_worker to handle the complexities of parsing mathematical expressions and comparing them to ground truth answers.


Sources: examples/math/boba_grpo.py44-60 examples/tir/train_tir.py15-24 examples/math/boba_grpo.py48

Training Variations

Supervised Fine-Tuning (SFT)

Before RL training, models are often fine-tuned on GSM8K using the SFTTrainer and SFTConfig. This provides a stable starting policy for the reasoning task examples/math/gsm8k_sft.py25-28 examples/math/gsm8k_sft.yaml1-60

Distilled Model Training (LRM)

AReaL supports training distilled models (e.g., Qwen2.5-1.5B-Instruct) to achieve high performance on math reasoning benchmarks. This often involves iterative context lengthening and specialized prompt templates examples/math/gsm8k_grpo.yaml46

Tool-Integrated Reasoning (TIR)

GSM8K training can be extended to use external tools (like a Python interpreter or calculator) by switching to the TIRWorkflow. This involves a TIRGRPOConfig and a specialized reward function examples/tir/train_tir.py10-24 examples/tir/train_tir.py58

Multi-Turn Math Training

AReaL supports training multi-turn math agents by utilizing the ArealOpenAI client in concat mode. This allows for the construction of conversation trees where rewards can be assigned and discounted across multiple turns examples/math/boba_grpo.py69-76

Allocation and Parallelism

The backend strings define the physical layout of the engines:

The weight_update_mode is set to xccl to enable high-speed weight synchronization between training and inference engines via the XCCL protocol examples/math/gsm8k_grpo.yaml82

Sources: examples/math/gsm8k_grpo.yaml22-82 examples/vlm/clevr_count_70k_grpo.yaml22-43

Monitoring

Training progress is tracked through stats_tracker. Key metrics include:

Sources: examples/countdown/train.py50-67 examples/math/gsm8k_grpo.yaml64-81