ArchonEngine

Purpose and Scope

ArchonEngine is a custom torch-native training backend that implements the TrainEngine interface areal/experimental/engine/archon_engine.py147-148 It provides multi-dimensional parallelism (DP, TP, PP, CP, EP) using PyTorch's native distributed primitives (FSDP2, DTensor, DeviceMesh) without Megatron-Core dependencies. The engine is specifically optimized for Mixture-of-Experts (MoE) models and supports advanced pipeline schedules and FP8 training.

For other training backends, see FSDPEngine (page 3.2) and MegatronEngine (page 3.3).

Supported Parallelism:

Data Parallel: FSDP2 sharding across data-parallel groups areal/experimental/engine/archon_engine.py318-320
Tensor Parallel: DTensor with sharding strategies applied via model-specific parallelization functions areal/experimental/engine/archon_engine.py309-311
Pipeline Parallel: 1F1B, Interleaved 1F1B, DualPipeV, and ZBV (Zero-Bubble) schedules areal/experimental/engine/archon_runner.py148-154 areal/experimental/engine/archon_engine.py18-22
Context Parallel: Ulysses sequence parallelism for long-context training, including input slicing and output gathering areal/experimental/engine/archon_engine.py79-82
Expert Parallel: Support for MoE models with expert sharding and specialized weight conversion areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py29-30 areal/experimental/models/archon/moe_weight_converter.py12-15

Key Components:

ArchonEngine: Main engine implementing the TrainEngine API areal/experimental/engine/archon_engine.py147
ForwardBackwardRunner: Abstraction for sequential vs. pipelined execution areal/experimental/engine/archon_runner.py30
ModelSpec: Registry for pluggable model architecture support areal/experimental/models/archon/__init__.py71-75
ArchonParallelDims: Management of device meshes and parallel dimensions areal/experimental/models/archon/__init__.py69
WeightSyncState: State container for NCCL/XCCL weight synchronization areal/experimental/engine/archon_weight_sync.py30-46

Sources: areal/experimental/engine/archon_engine.py147-200 areal/experimental/engine/archon_runner.py30-53 areal/experimental/engine/archon_weight_sync.py30-46 areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py21-32

Architecture Overview

Component Architecture

Diagram: ArchonEngine Core Entities

Diagram: Model Implementation and Registry Mapping

Sources: areal/experimental/engine/archon_engine.py157-173 areal/experimental/engine/archon_runner.py56-124 areal/experimental/models/archon/qwen3/model/args.py18-54 areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py21-136

ForwardBackwardRunner

The ForwardBackwardRunner (areal/experimental/engine/archon_runner.py30) handles the micro-batch execution logic. It abstracts the differences between standard sequential execution and pipeline-parallel schedules.

SequentialRunner

Used when pipeline parallelism is disabled (pp=1) areal/experimental/engine/archon_runner.py56 It iterates through micro-batches, performing a forward pass and an optional backward pass for each. It includes specialized support for TreeAttentionMeta when tree training is active areal/experimental/engine/archon_runner.py81-107

PipelinedRunner

Used when pp > 1 areal/experimental/engine/archon_runner.py124 It leverages torch.distributed.pipelining to execute schedules.

Schedules: Supports 1F1B, Interleaved 1F1B, and Zero-Bubble variants like ScheduleDualPipeV and ScheduleZBVZeroBubble areal/experimental/engine/archon_runner.py148-154
Memory Optimization: Includes a "skip output merge" patch in _patch_skip_output_merge (areal/experimental/engine/archon_runner.py162) that prevents torch.cat of all micro-batch outputs, significantly reducing peak memory during the forward pass areal/experimental/engine/archon_runner.py162-175

Sources: areal/experimental/engine/archon_runner.py30-175 areal/experimental/engine/archon_engine.py18-22

Weight Synchronization

ArchonEngine supports high-performance weight synchronization between training and inference engines via the archon_weight_sync.py module.

XCCL Synchronization

update_weights_from_distributed (areal/experimental/engine/archon_weight_sync.py114) performs a live broadcast of weights over NCCL/XCCL.

Initialization: init_weight_update_group sets up a dedicated TCP store and process group for the transfer areal/experimental/engine/archon_weight_sync.py49-92
Buffering: Weights are collected into buckets (defined by weight_chunked_mem_mb) to optimize network throughput areal/experimental/engine/archon_weight_sync.py131-163
DTensor Handling: _get_full_tensor (areal/experimental/engine/archon_weight_sync.py95) automatically handles DTensor by calling full_tensor() to gather sharded weights before broadcasting areal/experimental/engine/archon_weight_sync.py98-106
Coordination: The training engine pauses inference generation via pause_generation(), performs the broadcast using _update_bucket_weights(), and then resumes generation areal/experimental/engine/archon_weight_sync.py127-172

Sources: areal/experimental/engine/archon_weight_sync.py49-210

Checkpointing and State Management

ArchonEngine utilizes PyTorch Distributed Checkpoint (DCP) for efficient sharded saving and loading.

DCPState

The DCPState class (areal/experimental/engine/archon_checkpoint.py86) wraps model parts and optimizers for DCP operations.

PP Support: Handles non-strict loading in Pipeline Parallel mode since each stage only contains a subset of model keys areal/experimental/engine/archon_checkpoint.py115-116
Optimizer Flattening: Uses flatten_optimizer_state_dict=True to avoid parameter group index collisions across pipeline stages by using unique FQNs as keys areal/experimental/engine/archon_checkpoint.py90-92
CPU Offload: Ensures tensors are moved to CPU during checkpointing to prevent OOM on the training devices areal/experimental/engine/archon_checkpoint.py122

State Dict Adapters

The engine uses BaseStateDictAdapter (areal/experimental/models/archon/base.py57) to convert between HuggingFace and Archon internal formats. For example, the Qwen3_5StateDictAdapter handles:

MoE Expert Mapping: Converts HF's list of 2D weights into Archon's 3D combined tensors areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py29-30
Composite Namespaces: Supports remapping text weights into composite namespaces for VLM checkpoints areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py38-39

Sources: areal/experimental/engine/archon_checkpoint.py86-166 areal/experimental/models/archon/base.py57-142 areal/experimental/models/archon/qwen3_5/model/state_dict_adapter.py21-136

URL: https://deepwiki.com/inclusionAI/AReaL/3.4-archonengine

⇱ ArchonEngine | inclusionAI/AReaL | DeepWiki