Last indexed: 7 May 2026 (2e12c1)

Trainer Orchestration

Purpose and Scope

This document describes the orchestration classes in AReaL, including SFTTrainer, PPOTrainer, RWTrainer, and DPOTrainer. These classes serve as the top-level controllers that manage the complete training lifecycle, coordinating model updates, evaluation, checkpointing, and recovery across distributed resources. They bridge the high-level configuration and dataset management with the low-level distributed execution of training and inference engines.

Related pages:

For RL algorithm implementations (PPO, GRPO, etc.), see page 7.2 and 7.3
For asynchronous training mechanics, see page 7.5
For rollout coordination details, see page 5.3
For training engine APIs, see page 3.1

Trainer Overview

AReaL provides several trainer classes that orchestrate different training paradigms. Each trainer is responsible for initializing its respective engines (Actor, Critic, Reference, etc.) and managing the data flow between them.

Trainer	Purpose	Training Mode	Key Components
`SFTTrainer`	Supervised fine-tuning	Single-phase (training only)	`Actor` engine, `StatefulDataLoader`
`PPOTrainer`	Reinforcement learning (PPO/GRPO)	Two-phase (rollout + training)	`Actor`, `Ref`, `Critic` engines, `RolloutController`
`RWTrainer`	Reward Model training	Single-phase (preference pairs)	`Actor` engine, `rw_modeling_collate_fn`
`DPOTrainer`	Direct Preference Optimization	Single-phase (preference pairs)	`Actor`, `Ref` engines, `dpo_modeling_collate_fn`

Architecture Diagram: Trainer Hierarchy

Key architectural differences:

SFTTrainer: Follows a traditional supervised learning pattern where batches are loaded from a dataset and directly used for training via train_lm. areal/trainer/sft_trainer.py191
PPOTrainer: Implements a two-phase RL loop where rollouts are generated via inference engines, then used to train the policy via train_rl. areal/trainer/rl_trainer.py328
RWTrainer/DPOTrainer: Designed for preference modeling. RWTrainer handles chosen/rejected pairs for reward modeling areal/trainer/rw_trainer.py54-73 while DPOTrainer manages both an Actor and a Reference model to compute DPO loss. areal/trainer/dpo_trainer.py112-116

Sources: areal/trainer/sft_trainer.py54-147 areal/trainer/rl_trainer.py102-243 areal/trainer/rw_trainer.py76-180 areal/trainer/dpo_trainer.py84-189

SFTTrainer Architecture

The SFTTrainer class provides a streamlined orchestrator for supervised fine-tuning. It manages a single training engine and executes a standard training loop over labeled datasets.

Initialization

The constructor initializes the training environment and distributed components areal/trainer/sft_trainer.py55-147:

File logging setup: Calls logging.setup_file_logging() if is_single_controller() is True. areal/trainer/sft_trainer.py62-63
Seed initialization: Sets random seed via seeding.set_random_seed(). areal/trainer/sft_trainer.py76
Tokenizer loading: Loads tokenizer/processor via load_hf_processor_and_tokenizer(). areal/trainer/sft_trainer.py66-68
Scheduler creation: Initializes LocalScheduler, RayScheduler, or SlurmScheduler based on configuration. areal/trainer/sft_trainer.py69-71
Actor engine creation: Instantiates the backend engine (FSDP, Megatron, or Archon) and initializes it with FinetuneSpec. areal/trainer/sft_trainer.py80-112
DataLoader setup: Creates StatefulDataLoader instances via create_dataloader(). areal/trainer/sft_trainer.py98-103
Utility initialization: Sets up Evaluator, Saver, RecoverHandler, and StatsLogger. areal/trainer/sft_trainer.py135-138
Recovery: Loads checkpoint via recover_handler.load(). areal/trainer/sft_trainer.py139-145

Training Loop

The SFTTrainer.train() method areal/trainer/sft_trainer.py148-237 executes an epoch-step loop:

Sources: areal/trainer/sft_trainer.py54-408 areal/trainer/sft/lm_engine.py52-64

PPOTrainer Architecture

The PPOTrainer class implements the orchestration logic for RL algorithms. It manages actor, critic, and reference engines along with a pool of remote inference engines. areal/trainer/rl_trainer.py102-144

PPOTrainer Initialization

The PPOTrainer coordinates the lifecycle of multiple distributed components:

Scheduler: Manages remote worker allocation. areal/trainer/rl_trainer.py120
Inference engines: Created via _create_rollout_controller() to manage remote RemoteSGLangEngine or RemotevLLMEngine instances. areal/trainer/rl_trainer.py202
Training engines: Actor (policy), reference model (optional), critic (PPO only). areal/trainer/rl_trainer.py168-186
Rollout coordinator: RolloutController manages asynchronous rollout collection and capacity control. areal/infra/controller/rollout_controller.py70-128
DataLoader: Supports online mode via _EmptyDataLoader if no training dataset is provided. areal/trainer/rl_trainer.py75-101

Sources: areal/trainer/rl_trainer.py75-202 areal/infra/controller/rollout_controller.py70-128

RWTrainer and DPOTrainer Architecture

These trainers are specialized for preference learning, using specific collation functions to prepare data.

Preference Data Collation

RWTrainer: rw_modeling_collate_fn produces two dicts per item (chosen first, then rejected) with input_ids and attention_mask. areal/trainer/rw_trainer.py54-73
DPOTrainer: dpo_modeling_collate_fn produces similar pairs but also includes a loss_mask for both chosen and rejected sequences. areal/trainer/dpo_trainer.py54-81

Training Loops

Both trainers utilize a similar loop structure to SFTTrainer:

RWTrainer: Calls actor.train_rw(batch) to perform reward model optimization. areal/trainer/rw_trainer.py224
DPOTrainer: Orchestrates both actor and ref engines. The actor.train_dpo(batch, ref=self.ref) call coordinates the computation of log-probabilities from both models to compute the DPO loss. areal/trainer/dpo_trainer.py221

Sources: areal/trainer/rw_trainer.py54-270 areal/trainer/dpo_trainer.py54-271

Data Distribution and Control

The TrainController is the base class for managing distributed training across multiple workers. It handles data splitting across data-parallel (DP) groups.

Data Dispatch Logic

The _dispatch_tensors function partitions trajectories across DP groups using a balanced greedy algorithm:

It calculates token_weights for each item. areal/infra/controller/train_controller.py95
It uses balanced_greedy_partition to ensure DP ranks receive roughly equal token counts. areal/infra/controller/train_controller.py103
It supports group_size to keep atomic units (like chosen/rejected pairs) together on the same rank. areal/infra/controller/train_controller.py79-88

Training Loop Architectures

Sources: areal/infra/controller/train_controller.py76-123 areal/trainer/rl_trainer.py245-385

Weight Synchronization

After training steps, the trainer synchronizes updated actor weights to inference engines to maintain policy freshness.

Key Code Entities:

WeightUpdateMeta: Contains path or NCCL/XCCL group info. areal/api/io_struct.py21-34
actor.prepare_weight_update(): Generates the update metadata. areal/trainer/rl_trainer.py365
RolloutController.update_weights(): Propagates the metadata to all inference workers. areal/infra/controller/rollout_controller.py251-267

Sources: areal/trainer/rl_trainer.py365-375 areal/infra/controller/rollout_controller.py251-267

Recovery and Checkpointing

The RecoverHandler areal/utils/recover.py146-348 manages state restoration. It ensures that model weights, dataloader states, and utility controller states are preserved.

Recovery Logic:

All trainers initialize a RecoverHandler during __init__. areal/trainer/sft_trainer.py137 areal/trainer/rl_trainer.py199 areal/trainer/rw_trainer.py165 areal/trainer/dpo_trainer.py175
The handler looks for recover_info directory containing step_info.json and dataloader_info.pkl. areal/utils/recover.py221-276
Training resumes from the exact global_step and dataset offset stored in the checkpoint. areal/trainer/sft_trainer.py150-154

Sources: areal/utils/recover.py146-348 areal/trainer/sft_trainer.py137-154 areal/trainer/dpo_trainer.py175-187

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/7.4-trainer-orchestration