Last indexed: 7 May 2026 (2e12c1)

TrainEngine API

The TrainEngine API defines the abstract interface for training backends in AReaL. It provides a unified contract for model training, weight synchronization, distributed coordination, and integration with inference engines. This page documents the core interface, its implementations, and usage patterns.

For information about specific training backends (FSDP, Megatron, Archon), see pages 3.2 3.3 and 3.4 For rollout integration and workflow execution, see page 5.3 For weight synchronization mechanisms, see page 3.6

Interface Overview

The TrainEngine is an abstract base class that standardizes training operations across different distributed training frameworks. All implementations must provide methods for initialization, training, evaluation, checkpointing, and weight synchronization with inference engines.

Natural Language to Code Entity Space: Engine Hierarchy

The following diagram maps the conceptual "Training Engine" to the specific class hierarchy and their core responsibilities in the codebase.

Sources: areal/api/engine_api.py32-264 areal/experimental/engine/archon_engine.py147-200 areal/engine/fsdp_engine.py218-250 areal/engine/megatron_engine.py168-193

Lifecycle Methods

Process Group Initialization

The create_process_group() method initializes distributed process groups based on the specified parallelism strategy. This must be called before initialize().

Key process groups created:

Data parallel group: Ranks that hold different data shards areal/api/engine_api.py61-69
Context and Model parallel group: Combined group for sequence and model parallelism areal/api/engine_api.py119-127
CPU group: Gloo backend for CPU barriers and weight offloading areal/api/engine_api.py131-139

Sources: areal/api/engine_api.py33-42 areal/experimental/engine/archon_engine.py201-235 areal/engine/fsdp_engine.py219-253 areal/engine/megatron_engine.py195-230

Engine Initialization

The initialize() method loads the model, applies parallelism strategies, and creates optimizers. It accepts a FinetuneSpec areal/api/io_struct.py134-147 that specifies training hyperparameters.

Step	ArchonEngine	FSDPEngine	MegatronEngine
1. Model	`PipelineStage` creation	`AutoModelForCausalLM`	`make_mcore_model`
2. Parallelism	`PipelinedRunner`	`parallelize_model()`	`DistributedDataParallel`
3. Optimizer	`create_optimizer()`	`AnyPrecisionAdamW`	`get_megatron_optimizer`

Sources: areal/api/engine_api.py44-57 areal/experimental/engine/archon_engine.py237-330 areal/engine/fsdp_engine.py255-353 areal/engine/megatron_engine.py232-350

Training Methods

Micro-Batch Processing

All engines use a unified micro-batch splitting and processing pipeline. The MicroBatchList structure organizes data for gradient accumulation.

Stateless utilities in areal/engine/core/train_engine.py:

compute_total_loss_weight: Aggregates weights across DP group using all_reduce areal/engine/core/train_engine.py30-65
aggregate_eval_losses: Reduces losses across DP and PP groups areal/engine/core/train_engine.py68-107
reorder_and_pad_outputs: Post-processing for forward_batch areal/engine/core/train_engine.py110-144

Sources: areal/engine/core/train_engine.py1-145 areal/experimental/engine/archon_runner.py30-51

Training Batch

The train_batch() method performs a complete training step with gradient accumulation areal/api/engine_api.py239-251

Training controllers like LMEngine areal/trainer/sft/lm_engine.py29-34 and RWEngine areal/trainer/rw/rw_engine.py41-77 call this method with algorithm-specific loss functions like compute_packed_sft_loss areal/trainer/sft/lm_engine.py79-130 or compute_rw_loss areal/trainer/rw/rw_engine.py115-146

Sources: areal/api/engine_api.py239-251 areal/trainer/sft/lm_engine.py29-34 areal/trainer/rw/rw_engine.py41-77

Inference Integration

Weight Synchronization

The update_weights() method synchronizes trained weights to inference engines areal/api/engine_api.py175-183

Weight update modes:

Mode	Backend Method	Description
xccl	`build_distributed_weight_update_requests()`	NCCL/XCCL broadcast across training→inference ranks areal/engine/sglang_remote.py161-187
disk	`build_disk_weight_update_requests()`	Save checkpoint to shared storage, inference loads areal/engine/vllm_remote.py129-148

Natural Language to Code Entity Space: Weight Sync Flow

Sources: areal/api/engine_api.py175-183 areal/engine/sglang_remote.py161-187 areal/engine/vllm_remote.py129-183 areal/experimental/engine/archon_weight_sync.py114-176

Checkpointing

Save and Load

The save() and load() methods manage model and optimizer state persistence areal/api/engine_api.py253-264

Archon Checkpoint Implementation:

save_model_to_hf(): Converts distributed weights to a single HuggingFace-compatible checkpoint areal/experimental/engine/archon_checkpoint.py17-54
Handles DTensor full tensor gathering and CPU offload conversion areal/experimental/engine/archon_weight_sync.py95-110

Sources: areal/api/engine_api.py253-264 areal/experimental/engine/archon_checkpoint.py17-54 areal/experimental/engine/archon_weight_sync.py95-110

Advanced Features

Tree Training

Tree training enables training on trajectory trees (multiple branches per node) using specialized attention mechanisms.

Key components:

TreeAttentionMeta: Metadata passed to model forward for tree-based masking areal/experimental/engine/archon_runner.py81-87
SequentialRunner handles tree attention by creating dummy cu_seqlens for model compatibility areal/experimental/engine/archon_runner.py88-94

Sources: areal/experimental/engine/archon_runner.py81-105 areal/models/tree_attn/module.py102-105

Memory Offloading

The training engines support moving model parameters to CPU during inference phases to save GPU memory. ArchonEngine manages this state during weight synchronization by checking tensor devices and moving to the appropriate platform device type areal/experimental/engine/archon_weight_sync.py108-110

Sources: areal/experimental/engine/archon_weight_sync.py95-110 areal/utils/offload.py131-132

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/3.1-trainengine-api

⇱ TrainEngine API | inclusionAI/AReaL | DeepWiki