VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3.1-trainengine-api

⇱ TrainEngine API | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

TrainEngine API

The TrainEngine API defines the abstract interface for training backends in AReaL. It provides a unified contract for model training, weight synchronization, distributed coordination, and integration with inference engines. This page documents the core interface, its implementations, and usage patterns.

For information about specific training backends (FSDP, Megatron, Archon), see pages 3.2 3.3 and 3.4 For rollout integration and workflow execution, see page 5.3 For weight synchronization mechanisms, see page 3.6


Interface Overview

The TrainEngine is an abstract base class that standardizes training operations across different distributed training frameworks. All implementations must provide methods for initialization, training, evaluation, checkpointing, and weight synchronization with inference engines.

Natural Language to Code Entity Space: Engine Hierarchy

The following diagram maps the conceptual "Training Engine" to the specific class hierarchy and their core responsibilities in the codebase.


Sources: areal/api/engine_api.py32-264 areal/experimental/engine/archon_engine.py147-200 areal/engine/fsdp_engine.py218-250 areal/engine/megatron_engine.py168-193


Lifecycle Methods

Process Group Initialization

The create_process_group() method initializes distributed process groups based on the specified parallelism strategy. This must be called before initialize().


Key process groups created:

Sources: areal/api/engine_api.py33-42 areal/experimental/engine/archon_engine.py201-235 areal/engine/fsdp_engine.py219-253 areal/engine/megatron_engine.py195-230

Engine Initialization

The initialize() method loads the model, applies parallelism strategies, and creates optimizers. It accepts a FinetuneSpec areal/api/io_struct.py134-147 that specifies training hyperparameters.

StepArchonEngineFSDPEngineMegatronEngine
1. ModelPipelineStage creationAutoModelForCausalLMmake_mcore_model
2. ParallelismPipelinedRunnerparallelize_model()DistributedDataParallel
3. Optimizercreate_optimizer()AnyPrecisionAdamWget_megatron_optimizer

Sources: areal/api/engine_api.py44-57 areal/experimental/engine/archon_engine.py237-330 areal/engine/fsdp_engine.py255-353 areal/engine/megatron_engine.py232-350


Training Methods

Micro-Batch Processing

All engines use a unified micro-batch splitting and processing pipeline. The MicroBatchList structure organizes data for gradient accumulation.


Stateless utilities in areal/engine/core/train_engine.py:

Sources: areal/engine/core/train_engine.py1-145 areal/experimental/engine/archon_runner.py30-51

Training Batch

The train_batch() method performs a complete training step with gradient accumulation areal/api/engine_api.py239-251


Training controllers like LMEngine areal/trainer/sft/lm_engine.py29-34 and RWEngine areal/trainer/rw/rw_engine.py41-77 call this method with algorithm-specific loss functions like compute_packed_sft_loss areal/trainer/sft/lm_engine.py79-130 or compute_rw_loss areal/trainer/rw/rw_engine.py115-146

Sources: areal/api/engine_api.py239-251 areal/trainer/sft/lm_engine.py29-34 areal/trainer/rw/rw_engine.py41-77


Inference Integration

Weight Synchronization

The update_weights() method synchronizes trained weights to inference engines areal/api/engine_api.py175-183

Weight update modes:

ModeBackend MethodDescription
xcclbuild_distributed_weight_update_requests()NCCL/XCCL broadcast across training→inference ranks areal/engine/sglang_remote.py161-187
diskbuild_disk_weight_update_requests()Save checkpoint to shared storage, inference loads areal/engine/vllm_remote.py129-148

Natural Language to Code Entity Space: Weight Sync Flow


Sources: areal/api/engine_api.py175-183 areal/engine/sglang_remote.py161-187 areal/engine/vllm_remote.py129-183 areal/experimental/engine/archon_weight_sync.py114-176


Checkpointing

Save and Load

The save() and load() methods manage model and optimizer state persistence areal/api/engine_api.py253-264

Archon Checkpoint Implementation:

Sources: areal/api/engine_api.py253-264 areal/experimental/engine/archon_checkpoint.py17-54 areal/experimental/engine/archon_weight_sync.py95-110


Advanced Features

Tree Training

Tree training enables training on trajectory trees (multiple branches per node) using specialized attention mechanisms.

Key components:

Sources: areal/experimental/engine/archon_runner.py81-105 areal/models/tree_attn/module.py102-105

Memory Offloading

The training engines support moving model parameters to CPU during inference phases to save GPU memory. ArchonEngine manages this state during weight synchronization by checking tensor devices and moving to the appropriate platform device type areal/experimental/engine/archon_weight_sync.py108-110

Sources: areal/experimental/engine/archon_weight_sync.py95-110 areal/utils/offload.py131-132