VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/8.4-archon-parallelism

⇱ Archon Parallelism | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Archon Parallelism

This page documents Archon Engine's parallelism implementation, including ArchonParallelDims configuration, device mesh construction, pipeline schedules, and DTensor integration. Archon uses PyTorch-native distributed APIs, integrating DTensor, fully_shard (FSDP2), torch.distributed.pipelining, and advanced activation checkpointing.


ArchonParallelDims Configuration

The ArchonParallelDims class is the central configuration object that defines all parallelism dimensions and manages device mesh creation for Archon Engine. It is inspired by PyTorch's torchtitan but customized for AReaL's requirements.

Parallelism Dimensions

DimensionDescriptionTypical Use Case
dp_shardFSDP shard dimension (data parallel)Sharding model parameters across GPUs; auto-computed if -1 areal/experimental/models/archon/parallel_dims.py112
tpTensor Parallel sizeSharding large layers (attention, FFN) across GPUs areal/experimental/models/archon/parallel_dims.py114
cpContext Parallel size (Ulysses SP)Distributing long sequences via all-to-all communication areal/experimental/models/archon/parallel_dims.py113
ppPipeline Parallel sizeSplitting model layers across pipeline stages areal/experimental/models/archon/parallel_dims.py115
epExpert Parallel sizeDistributing MoE experts across GPUs areal/experimental/models/archon/parallel_dims.py116
etpExpert Tensor Parallel sizeMust be 1 or equal to tp areal/experimental/models/archon/parallel_dims.py117

Constraint: dp_shard × tp × cp × pp = world_size areal/experimental/models/archon/parallel_dims.py129-135

When dp_shard = -1, it is auto-computed as world_size // (tp × cp × pp) areal/experimental/models/archon/parallel_dims.py126-127

Sources: areal/experimental/models/archon/parallel_dims.py26-167

Key Properties


The fsdp_gradient_divide_factor ensures consistent gradient scaling for FSDP-sharded experts when Expert Parallelism is enabled areal/experimental/models/archon/parallel_dims.py361-372

Sources: areal/experimental/models/archon/parallel_dims.py361-378


Parallelization Ordering

Archon Engine applies parallelization strategies in a strict order to avoid conflicts between distributed primitives. This ordering is defined in model-specific parallelization functions like parallelize_qwen3 and parallelize_qwen2.

Parallelization Pipeline


Diagram: Parallelization Application Order in Archon Engine

Ordering Rationale:

  1. TP First: Establishes sequence parallelism and DTensor sharding patterns for attention and dense layers areal/experimental/models/archon/qwen3/infra/parallelize.py134-137 areal/experimental/models/archon/qwen2/infra/parallelize.py116-118
  2. EP/ETP: MoE layers are parallelized after TP to handle Expert Tensor Parallelism (etp) which may borrow from the TP dimension areal/experimental/models/archon/qwen3/infra/parallelize.py141-150
  3. CP (Ulysses): Context parallelism via all-to-all communication is applied after TP to ensure head distribution is compatible with sequence splitting areal/experimental/models/archon/qwen3/infra/parallelize.py153-155 areal/experimental/models/archon/qwen2/infra/parallelize.py121-123
  4. Activation Checkpointing: Applied after distributed primitives so that the recomputation logic can correctly handle DTensor operations areal/experimental/models/archon/qwen3/infra/parallelize.py158-164 areal/experimental/models/archon/qwen2/infra/parallelize.py126-132
  5. torch.compile: Applied before FSDP to allow the compiler to optimize the internal graph of transformer blocks areal/experimental/models/archon/qwen3/infra/parallelize.py167-169 areal/experimental/models/archon/qwen2/infra/parallelize.py135-136
  6. FSDP Last: FSDP wraps the already-parallelized sub-modules for parameter sharding and gradient reduction areal/experimental/models/archon/qwen3/infra/parallelize.py172-193 areal/experimental/models/archon/qwen2/infra/parallelize.py139-149

Sources: areal/experimental/models/archon/qwen3/infra/parallelize.py86-193 areal/experimental/models/archon/qwen2/infra/parallelize.py71-156


Pipeline Parallelism and Runners

Archon implements pipeline parallelism (PP) using torch.distributed.pipelining.

Stage Management

The function generate_llm_fqn_per_model_part distributes transformer layers across pipeline stages areal/experimental/models/archon/pipeline_parallel.py85-197 It assigns weights to input modules (tok_embeddings) and output modules (norm, output/score) to balance computational load areal/experimental/models/archon/pipeline_parallel.py97-103

Pipeline Runners and Schedules

Archon supports various pipeline schedules via build_pipeline_schedule areal/experimental/models/archon/pipeline_parallel.py48-82 The execution is managed by PipelinedRunner areal/experimental/engine/archon_runner.py124-142

Key schedules include:

The PipelinedRunner includes a memory optimization hack that patches schedule._merge_outputs to skip the torch.cat operation for microbatch outputs, significantly reducing peak memory usage areal/experimental/engine/archon_runner.py162-175

Model Specifications

The ModelSpec dataclass registers model-specific parallelization and pipelining logic areal/experimental/models/archon/model_spec.py86-96 It allows the engine to retrieve the correct ParallelizeFn areal/experimental/models/archon/model_spec.py26-52 and PipeliningFn areal/experimental/models/archon/model_spec.py55-82 based on the HuggingFace model_type areal/experimental/models/archon/model_spec.py113-118

Sources: areal/experimental/models/archon/pipeline_parallel.py9-197 areal/experimental/models/archon/model_spec.py5-138 areal/experimental/engine/archon_runner.py124-175


MoE Expert Parallelism

Archon supports Expert Parallelism (EP) and Expert Tensor Parallelism (ETP) strategies within the MoE module areal/experimental/models/archon/moe/moe.py40-61

Strategy Selection Matrix

EPTPetpStrategyExpert Weight Sharding
11-NoneReplicate
1>1-TensorParallel[Shard(1/2)]
>11-ExpertParallel[Shard(0)]
>1>11ExpertParallel[Shard(0)] (TP borrowed by EP)
>1>1tpExpertTensorParallel[Shard(0), Shard(1/2)]

Sources: areal/experimental/models/archon/parallel_dims.py49-57

Implementation Logic

The MoE layer uses a TokenChoiceTopKRouter to assign tokens to experts areal/experimental/models/archon/moe/moe.py74-85 If EP is enabled, dispatch and combination happen via hooks registered by ExpertParallel areal/experimental/models/archon/moe/moe.py173-175

When etp=1 and tp>1, the TP dimension is "borrowed" by EP for token dispatch, meaning experts are only sharded by the expert dimension areal/experimental/models/archon/parallel_dims.py60-62 When etp=tp, experts use 2D sharding combining both dimensions areal/experimental/models/archon/parallel_dims.py63-65

Sources: areal/experimental/models/archon/parallel_dims.py49-109 areal/experimental/models/archon/moe/moe.py40-175


Activation Checkpointing (AC)

Archon provides flexible AC configurations via ActivationCheckpointConfig areal/experimental/models/archon/activation_checkpoint.py37-68

AC Modes

Sources: areal/experimental/models/archon/activation_checkpoint.py37-210


Context Parallelism (Ulysses SP)

Archon implements Ulysses Sequence Parallelism (SP) for context parallelism. This approach uses All-to-All communication to redistribute data between the sequence dimension and the attention head dimension.

Ulysses Data Flow


Diagram: Ulysses Sequence Parallelism in Archon Attention

Implementation Detail

In the Attention module, the set_cp_group method configures the process group for communication areal/experimental/models/archon/qwen3/model/model.py161-174 areal/experimental/models/archon/qwen2/model/model.py99-112 During the forward pass, gather_seq_scatter_heads is called to redistribute xq, xk, and xv before the attention operation areal/experimental/models/archon/qwen2/model/model.py152-172 After attention, gather_heads_scatter_seq restores the original sequence sharding areal/experimental/models/archon/qwen2/model/model.py194-196

Sources: areal/experimental/models/archon/qwen3/model/model.py103-210 areal/experimental/models/archon/qwen2/model/model.py51-201


Weight Synchronization

Archon manages weight synchronization between training and inference engines using XCCL (NCCL/XCCL) process groups.

Weight Update Flow


Diagram: Weight Synchronization via XCCL in Archon

The WeightSyncState manages the process group initialized via init_weight_update_group areal/experimental/engine/archon_weight_sync.py30-92 Weights are collected into buckets and broadcast from the training rank to inference engines areal/experimental/engine/archon_weight_sync.py114-176 The function _get_full_tensor handles the conversion of DTensor and CPU-offloaded parameters to local device tensors before broadcasting areal/experimental/engine/archon_weight_sync.py95-111

Sources: areal/experimental/engine/archon_weight_sync.py30-212