Last indexed: 7 May 2026 (2e12c1)

Quick Start Guide

Purpose and Scope

This guide provides a hands-on tutorial for running your first reinforcement learning training job with AReaL. You will train a small language model on the GSM8K math reasoning task using the GRPO algorithm with function-based rewards. By the end of this guide, you will understand the basic workflow of launching training, monitoring progress, and interpreting results.

For installation instructions, see Setup and Installation. For detailed architecture explanations, see Architecture Overview.

Sources: README.md117-135 examples/math/gsm8k_grpo.yaml1-184

Prerequisites

Before starting, ensure your environment meets the hardware and software requirements.

Hardware Requirements

GPU: NVIDIA GPUs with CUDA support (e.g., H100, A100).
NPU: Support for Huawei Ascend (NPU) is officially available in the ascend branch for math and vision tasks README.md82-87
Storage: Local or shared storage (NAS/NFS) for checkpoints and logs, specified via cluster.fileroot examples/math/gsm8k_grpo.yaml12

Software Setup

Requirement	Command
Python 3.12+	`python3 --version` AGENTS.md7
uv package manager	`pip install uv` README.md127
Dependencies	`uv sync --extra cuda` AGENTS.md11
Flash Attention	`uv pip install <prebuilt_wheel_url>` README.md128-131

Your First Training Run

Minimal Working Example (Single Node)

AReaL's training scripts automatically download required datasets (openai/gsm8k) and models (Qwen/Qwen2.5-1.5B-Instruct). To run the example with default configuration:

This command performs the following:

Config Loading: The script parses the YAML and CLI overrides into a configuration object using load_expr_config examples/math/gsm8k_grpo.yaml1-184
Worker Management: By default, the trainer manages local resources if scheduler.type is null or local examples/math/gsm8k_grpo.yaml18-19
Execution: Runs RL training using the orchestrated workflow defined in the configuration examples/math/gsm8k_grpo.yaml1-184

Sources: examples/math/gsm8k_grpo.yaml1-184 AGENTS.md58-76

Distributed Training (Multi-Node)

For distributed experiments across clusters, use the Ray or Slurm schedulers. AReaL decouples generation and training across separate resources examples/math/gsm8k_grpo.yaml9-33

The RayScheduler or SlurmScheduler manages process spawning across nodes, while the backend string (e.g., sglang:d4p1t1) specifies the parallelism mesh examples/math/gsm8k_grpo.yaml22-43

Understanding the Configuration

Configuration Hierarchy

AReaL uses a hierarchical dataclass-based configuration system. Settings can be specified in YAML or overridden via CLI using key=value examples/math/gsm8k_grpo.yaml1-184

Resource Allocation Syntax

The backend fields (e.g., actor.backend, rollout.backend) define resource allocation using the pattern: backend_name:d<DP>p<PP>t<TP> examples/math/gsm8k_grpo.yaml22-43

DP (Data Parallel): Number of model replicas.
PP (Pipeline Parallel): Number of pipeline stages.
TP (Tensor Parallel): Number of tensor shards.

Example: rollout.backend=sglang:d4p1t1 allocates 4 GPUs for SGLang inference with DP=4 examples/math/gsm8k_grpo.yaml22

Sources: examples/math/gsm8k_grpo.yaml1-127 AGENTS.md84-90

Training Execution Flow

Natural Language to Code Entity Mapping

The following diagram maps the logical steps of an RL experiment to the specific classes and interfaces in the AReaL codebase.

Sources: examples/math/gsm8k_grpo.yaml1-184 AGENTS.md58-76 CLAUDE.md12-24

Implementation Detail: The Training Components

The system relies on modular components to orchestrate high-throughput RL.

Component	Code Entity	Role
Trainer	`PPOTrainer`	Orchestrates the async rollout and training loop CLAUDE.md12-24
Workflow	`RLVRWorkflow`	Manages episode generation and reward calculation CLAUDE.md20
Inference Backend	`SGLangBackend`	High-throughput generation via Radix Attention blog/AReaL_v0_2.md60-67
Training Engine	`FSDPEngine`	Sharded data parallelism for model training examples/math/gsm8k_grpo.yaml43

Sources: CLAUDE.md12-24 blog/AReaL_v0_2.md58-67

Performance Tuning

SGLang and Radix Attention

In v0.2, AReaL upgraded to SGLang v0.4.0, leveraging its radix attention mechanism. This significantly improves throughput when sampling multiple responses (e.g., n_samples: 4) from the same prompt by caching common prefixes blog/AReaL_v0_2.md60-67

Variable-Length Packing

To eliminate padding overhead, AReaL packs sequences into 1D tensors. A dynamic allocation algorithm distributes these sequences under a max_tokens_per_mb budget (e.g., 10240), maximizing GPU utilization blog/AReaL_v0_2.md69-76 examples/math/gsm8k_grpo.yaml51-53

Weight Update Mode

For high-performance scaling, use weight_update_mode: xccl. This utilizes NCCL with GPU-Direct RDMA (GDRDMA) to bypass CPU bottlenecks during generation-to-training data transfers blog/AReaL_v0_2.md77-83 examples/math/gsm8k_grpo.yaml82

Sources: blog/AReaL_v0_2.md54-84 examples/math/gsm8k_grpo.yaml51-83

Expected Outputs

Logs and Monitoring

Execution logs and checkpoints are saved to the path specified in cluster.fileroot examples/math/gsm8k_grpo.yaml12 Monitoring is available via:

WandB: Enabled via stats_logger.wandb.mode: online examples/math/gsm8k_grpo.yaml170-175
Performance Tracing: Enabled via perf_tracer.enabled: true for detailed session analysis examples/math/gsm8k_grpo.yaml177-184

Training Speed

In v0.2, AReaL achieved a 1.5x throughput improvement over v0.1 for 7B models blog/AReaL_v0_2.md13-15 You should see efficient GPU utilization and fast iteration times even with multiple samples per prompt.

Sources: blog/AReaL_v0_2.md1-20 examples/math/gsm8k_grpo.yaml142-184

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/1.3-quick-start-guide