Last indexed: 7 May 2026 (2e12c1)

Training Engines

Purpose and Scope

Training Engines are the core computational backends responsible for executing model training in AReaL. They abstract the complexity of distributed training, parallelism strategies, and optimization across different frameworks. This page provides an overview of the training engine architecture, including the available backend implementations and how they integrate into the asynchronous RL training pipeline.

For specific backend implementations, see FSDPEngine and ArchonEngine. For details on the abstract interface, see TrainEngine API. For training orchestration, see Trainer Orchestration.

Training Engine Overview

Training engines encapsulate all model-related operations during training, including forward/backward passes, optimization, checkpointing, and distributed communication. They provide a unified interface via the TrainEngine abstract class areal/api/engine_api.py32-33 with several primary backend implementations:

FSDPEngine: PyTorch FSDP2-based backend with N-D parallelism (DP+TP+SP) and native support for Vision-Language Models areal/engine/fsdp_engine.py218-240
MegatronEngine: NVIDIA Megatron-Core backend with full parallelism support (DP+TP+PP+CP+EP) and FP8 training capabilities areal/engine/megatron_engine.py168-186
ArchonEngine: Custom torch-native backend with advanced features like tree training and custom pipeline schedules areal/experimental/engine/archon_engine.py147-165

All training engines implement the same abstract interface, enabling seamless switching between backends via configuration. The choice of backend depends on model architecture, cluster configuration, and specific feature requirements.

Architecture Diagram: Training Engine Hierarchy

Sources: areal/api/engine_api.py32-240 areal/engine/fsdp_engine.py218-240 areal/engine/megatron_engine.py168-186 areal/experimental/engine/archon_engine.py147-196

Integration with Training Pipeline

Training engines integrate tightly with the broader training system, interfacing with trainers (e.g., PPOTrainer), rollout coordinators, and inference engines for asynchronous RL training.

Sequence Diagram: Engine in Training Pipeline

Sources: areal/engine/fsdp_engine.py218-250 areal/engine/megatron_engine.py168-186 areal/infra/dist_rollout.py1-20 areal/api/io_struct.py167-200

Configuration System

Training engines are configured via TrainEngineConfig and backend-specific configuration classes. The configuration determines parallelism strategy, optimization settings, memory management, and backend-specific features.

Class Diagram: Configuration Structure

Sources: areal/api/cli_args.py100-140 areal/api/cli_args.py441-550 areal/api/cli_args.py575-650

Training Batch Processing

All training engines follow a common microbatching pipeline for memory efficiency and throughput. The process involves packing sequences, splitting them into micro-batches, and executing forward/backward passes.

Flowchart: Training Batch Processing

Sources: areal/utils/data.py122-125 areal/engine/fsdp_engine.py116-127 areal/experimental/engine/archon_engine.py94-105

MicroBatch System

The microbatch system enables efficient processing of variable-length sequences with memory-constrained hardware. It uses strategies like token-based splitting to balance counts across micro-batches.

Key Components:

MicroBatchSpec: Configuration for microbatch splitting, including token limits (max_tokens_per_mb) and granularity. Supports ffd (First Fit Decreasing) and kk (Karmarkar-Karp) packing algorithms areal/api/cli_args.py100-140
MicroBatchList: Container for split microbatches, managing padding and reordering areal/utils/data.py118
MicroBatchItem: Individual microbatch with original and padded tensors areal/utils/data.py117

Sources: areal/api/cli_args.py100-140 areal/utils/data.py116-127

Weight Synchronization

Training engines support multiple weight synchronization modes for updating inference engines during RL training, defined in WeightUpdateMeta areal/api/io_struct.py167-200:

XCCL: Low-latency synchronization via NCCL/XCCL process groups, supported by FSDPEngine and ArchonEngine areal/experimental/engine/archon_weight_sync.py64-67
Disk: Robust synchronization via shared filesystem checkpoints areal/api/io_struct.py202-213
LoRA: Support for lightweight adapter updates with version tracking using get_versioned_lora_name areal/api/io_struct.py161-163

For details on weight synchronization mechanisms, see Weight Synchronization.

Memory Management

Training engines provide several memory management features to prevent OOM errors docs/en/best_practices/handling_oom.md1-181:

Gradient Checkpointing: Recomputes activations to save memory, configured in TrainEngineConfig areal/api/cli_args.py441-550
CPU Offloading: Moves parameters or optimizer states to CPU via CPUOffloadPolicy areal/engine/fsdp_engine.py31
Per-Layer Optimizer: Reduces peak memory by updating layers sequentially using PerLayerOptimWrapper. This streams optimizer states to device layer-by-layer areal/engine/fsdp_utils/optimizer.py85
Torch Memory Saver (TMS): Dynamic offloading during training phases areal/utils/offload.py131-132
Memory-Efficient Loading: Loads full weights on rank 0 and broadcasts to shards to reduce peak memory during initialization docs/en/best_practices/handling_oom.md203-223

For detailed memory optimization strategies, see Memory Management.

TrainEngine API — Abstract interface specification
FSDPEngine — PyTorch FSDP2 implementation details
ArchonEngine — Custom torch-native implementation
Microbatching Pipeline — Data flow and microbatching logic
Weight Synchronization — Weight update mechanisms
Checkpointing and Recovery — Model saving and recovery system
Memory Management — Memory optimization techniques

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/3-training-engines