VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3-training-engines

⇱ Training Engines | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Training Engines

Purpose and Scope

Training Engines are the core computational backends responsible for executing model training in AReaL. They abstract the complexity of distributed training, parallelism strategies, and optimization across different frameworks. This page provides an overview of the training engine architecture, including the available backend implementations and how they integrate into the asynchronous RL training pipeline.

For specific backend implementations, see FSDPEngine and ArchonEngine. For details on the abstract interface, see TrainEngine API. For training orchestration, see Trainer Orchestration.

Training Engine Overview

Training engines encapsulate all model-related operations during training, including forward/backward passes, optimization, checkpointing, and distributed communication. They provide a unified interface via the TrainEngine abstract class areal/api/engine_api.py32-33 with several primary backend implementations:

All training engines implement the same abstract interface, enabling seamless switching between backends via configuration. The choice of backend depends on model architecture, cluster configuration, and specific feature requirements.

Architecture Diagram: Training Engine Hierarchy


Sources: areal/api/engine_api.py32-240 areal/engine/fsdp_engine.py218-240 areal/engine/megatron_engine.py168-186 areal/experimental/engine/archon_engine.py147-196

Integration with Training Pipeline

Training engines integrate tightly with the broader training system, interfacing with trainers (e.g., PPOTrainer), rollout coordinators, and inference engines for asynchronous RL training.

Sequence Diagram: Engine in Training Pipeline


Sources: areal/engine/fsdp_engine.py218-250 areal/engine/megatron_engine.py168-186 areal/infra/dist_rollout.py1-20 areal/api/io_struct.py167-200

Configuration System

Training engines are configured via TrainEngineConfig and backend-specific configuration classes. The configuration determines parallelism strategy, optimization settings, memory management, and backend-specific features.

Class Diagram: Configuration Structure


Sources: areal/api/cli_args.py100-140 areal/api/cli_args.py441-550 areal/api/cli_args.py575-650

Training Batch Processing

All training engines follow a common microbatching pipeline for memory efficiency and throughput. The process involves packing sequences, splitting them into micro-batches, and executing forward/backward passes.

Flowchart: Training Batch Processing


Sources: areal/utils/data.py122-125 areal/engine/fsdp_engine.py116-127 areal/experimental/engine/archon_engine.py94-105

MicroBatch System

The microbatch system enables efficient processing of variable-length sequences with memory-constrained hardware. It uses strategies like token-based splitting to balance counts across micro-batches.

Key Components:

  • MicroBatchSpec: Configuration for microbatch splitting, including token limits (max_tokens_per_mb) and granularity. Supports ffd (First Fit Decreasing) and kk (Karmarkar-Karp) packing algorithms areal/api/cli_args.py100-140
  • MicroBatchList: Container for split microbatches, managing padding and reordering areal/utils/data.py118
  • MicroBatchItem: Individual microbatch with original and padded tensors areal/utils/data.py117

Sources: areal/api/cli_args.py100-140 areal/utils/data.py116-127

Weight Synchronization

Training engines support multiple weight synchronization modes for updating inference engines during RL training, defined in WeightUpdateMeta areal/api/io_struct.py167-200:

For details on weight synchronization mechanisms, see Weight Synchronization.

Memory Management

Training engines provide several memory management features to prevent OOM errors docs/en/best_practices/handling_oom.md1-181:

For detailed memory optimization strategies, see Memory Management.

Related Pages