VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/7.4-trainer-orchestration

⇱ Trainer Orchestration | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Trainer Orchestration

Purpose and Scope

This document describes the orchestration classes in AReaL, including SFTTrainer, PPOTrainer, RWTrainer, and DPOTrainer. These classes serve as the top-level controllers that manage the complete training lifecycle, coordinating model updates, evaluation, checkpointing, and recovery across distributed resources. They bridge the high-level configuration and dataset management with the low-level distributed execution of training and inference engines.

Related pages:

  • For RL algorithm implementations (PPO, GRPO, etc.), see page 7.2 and 7.3
  • For asynchronous training mechanics, see page 7.5
  • For rollout coordination details, see page 5.3
  • For training engine APIs, see page 3.1

Trainer Overview

AReaL provides several trainer classes that orchestrate different training paradigms. Each trainer is responsible for initializing its respective engines (Actor, Critic, Reference, etc.) and managing the data flow between them.

TrainerPurposeTraining ModeKey Components
SFTTrainerSupervised fine-tuningSingle-phase (training only)Actor engine, StatefulDataLoader
PPOTrainerReinforcement learning (PPO/GRPO)Two-phase (rollout + training)Actor, Ref, Critic engines, RolloutController
RWTrainerReward Model trainingSingle-phase (preference pairs)Actor engine, rw_modeling_collate_fn
DPOTrainerDirect Preference OptimizationSingle-phase (preference pairs)Actor, Ref engines, dpo_modeling_collate_fn

Architecture Diagram: Trainer Hierarchy


Key architectural differences:

  1. SFTTrainer: Follows a traditional supervised learning pattern where batches are loaded from a dataset and directly used for training via train_lm. areal/trainer/sft_trainer.py191
  2. PPOTrainer: Implements a two-phase RL loop where rollouts are generated via inference engines, then used to train the policy via train_rl. areal/trainer/rl_trainer.py328
  3. RWTrainer/DPOTrainer: Designed for preference modeling. RWTrainer handles chosen/rejected pairs for reward modeling areal/trainer/rw_trainer.py54-73 while DPOTrainer manages both an Actor and a Reference model to compute DPO loss. areal/trainer/dpo_trainer.py112-116

Sources: areal/trainer/sft_trainer.py54-147 areal/trainer/rl_trainer.py102-243 areal/trainer/rw_trainer.py76-180 areal/trainer/dpo_trainer.py84-189


SFTTrainer Architecture

The SFTTrainer class provides a streamlined orchestrator for supervised fine-tuning. It manages a single training engine and executes a standard training loop over labeled datasets.

Initialization

The constructor initializes the training environment and distributed components areal/trainer/sft_trainer.py55-147:

  1. File logging setup: Calls logging.setup_file_logging() if is_single_controller() is True. areal/trainer/sft_trainer.py62-63
  2. Seed initialization: Sets random seed via seeding.set_random_seed(). areal/trainer/sft_trainer.py76
  3. Tokenizer loading: Loads tokenizer/processor via load_hf_processor_and_tokenizer(). areal/trainer/sft_trainer.py66-68
  4. Scheduler creation: Initializes LocalScheduler, RayScheduler, or SlurmScheduler based on configuration. areal/trainer/sft_trainer.py69-71
  5. Actor engine creation: Instantiates the backend engine (FSDP, Megatron, or Archon) and initializes it with FinetuneSpec. areal/trainer/sft_trainer.py80-112
  6. DataLoader setup: Creates StatefulDataLoader instances via create_dataloader(). areal/trainer/sft_trainer.py98-103
  7. Utility initialization: Sets up Evaluator, Saver, RecoverHandler, and StatsLogger. areal/trainer/sft_trainer.py135-138
  8. Recovery: Loads checkpoint via recover_handler.load(). areal/trainer/sft_trainer.py139-145

Training Loop

The SFTTrainer.train() method areal/trainer/sft_trainer.py148-237 executes an epoch-step loop:


Sources: areal/trainer/sft_trainer.py54-408 areal/trainer/sft/lm_engine.py52-64


PPOTrainer Architecture

The PPOTrainer class implements the orchestration logic for RL algorithms. It manages actor, critic, and reference engines along with a pool of remote inference engines. areal/trainer/rl_trainer.py102-144

PPOTrainer Initialization

The PPOTrainer coordinates the lifecycle of multiple distributed components:

  1. Scheduler: Manages remote worker allocation. areal/trainer/rl_trainer.py120
  2. Inference engines: Created via _create_rollout_controller() to manage remote RemoteSGLangEngine or RemotevLLMEngine instances. areal/trainer/rl_trainer.py202
  3. Training engines: Actor (policy), reference model (optional), critic (PPO only). areal/trainer/rl_trainer.py168-186
  4. Rollout coordinator: RolloutController manages asynchronous rollout collection and capacity control. areal/infra/controller/rollout_controller.py70-128
  5. DataLoader: Supports online mode via _EmptyDataLoader if no training dataset is provided. areal/trainer/rl_trainer.py75-101

Sources: areal/trainer/rl_trainer.py75-202 areal/infra/controller/rollout_controller.py70-128


RWTrainer and DPOTrainer Architecture

These trainers are specialized for preference learning, using specific collation functions to prepare data.

Preference Data Collation

Training Loops

Both trainers utilize a similar loop structure to SFTTrainer:

  • RWTrainer: Calls actor.train_rw(batch) to perform reward model optimization. areal/trainer/rw_trainer.py224
  • DPOTrainer: Orchestrates both actor and ref engines. The actor.train_dpo(batch, ref=self.ref) call coordinates the computation of log-probabilities from both models to compute the DPO loss. areal/trainer/dpo_trainer.py221

Sources: areal/trainer/rw_trainer.py54-270 areal/trainer/dpo_trainer.py54-271


Data Distribution and Control

The TrainController is the base class for managing distributed training across multiple workers. It handles data splitting across data-parallel (DP) groups.

Data Dispatch Logic

The _dispatch_tensors function partitions trajectories across DP groups using a balanced greedy algorithm:

Training Loop Architectures


Sources: areal/infra/controller/train_controller.py76-123 areal/trainer/rl_trainer.py245-385


Weight Synchronization

After training steps, the trainer synchronizes updated actor weights to inference engines to maintain policy freshness.


Key Code Entities:

Sources: areal/trainer/rl_trainer.py365-375 areal/infra/controller/rollout_controller.py251-267


Recovery and Checkpointing

The RecoverHandler areal/utils/recover.py146-348 manages state restoration. It ensures that model weights, dataloader states, and utility controller states are preserved.

Recovery Logic:

Sources: areal/utils/recover.py146-348 areal/trainer/sft_trainer.py137-154 areal/trainer/dpo_trainer.py175-187