Archon Parallelism

This page documents Archon Engine's parallelism implementation, including ArchonParallelDims configuration, device mesh construction, pipeline schedules, and DTensor integration. Archon uses PyTorch-native distributed APIs, integrating DTensor, fully_shard (FSDP2), torch.distributed.pipelining, and advanced activation checkpointing.

ArchonParallelDims Configuration

The ArchonParallelDims class is the central configuration object that defines all parallelism dimensions and manages device mesh creation for Archon Engine. It is inspired by PyTorch's torchtitan but customized for AReaL's requirements.

Parallelism Dimensions

Dimension	Description	Typical Use Case
`dp_shard`	FSDP shard dimension (data parallel)	Sharding model parameters across GPUs; auto-computed if -1 areal/experimental/models/archon/parallel_dims.py112
`tp`	Tensor Parallel size	Sharding large layers (attention, FFN) across GPUs areal/experimental/models/archon/parallel_dims.py114
`cp`	Context Parallel size (Ulysses SP)	Distributing long sequences via all-to-all communication areal/experimental/models/archon/parallel_dims.py113
`pp`	Pipeline Parallel size	Splitting model layers across pipeline stages areal/experimental/models/archon/parallel_dims.py115
`ep`	Expert Parallel size	Distributing MoE experts across GPUs areal/experimental/models/archon/parallel_dims.py116
`etp`	Expert Tensor Parallel size	Must be 1 or equal to `tp` areal/experimental/models/archon/parallel_dims.py117

Constraint: dp_shard × tp × cp × pp = world_size areal/experimental/models/archon/parallel_dims.py129-135

When dp_shard = -1, it is auto-computed as world_size // (tp × cp × pp) areal/experimental/models/archon/parallel_dims.py126-127

Sources: areal/experimental/models/archon/parallel_dims.py26-167

Key Properties

The fsdp_gradient_divide_factor ensures consistent gradient scaling for FSDP-sharded experts when Expert Parallelism is enabled areal/experimental/models/archon/parallel_dims.py361-372

Sources: areal/experimental/models/archon/parallel_dims.py361-378

Parallelization Ordering

Archon Engine applies parallelization strategies in a strict order to avoid conflicts between distributed primitives. This ordering is defined in model-specific parallelization functions like parallelize_qwen3 and parallelize_qwen2.

Parallelization Pipeline

Diagram: Parallelization Application Order in Archon Engine

Ordering Rationale:

TP First: Establishes sequence parallelism and DTensor sharding patterns for attention and dense layers areal/experimental/models/archon/qwen3/infra/parallelize.py134-137 areal/experimental/models/archon/qwen2/infra/parallelize.py116-118
EP/ETP: MoE layers are parallelized after TP to handle Expert Tensor Parallelism (etp) which may borrow from the TP dimension areal/experimental/models/archon/qwen3/infra/parallelize.py141-150
CP (Ulysses): Context parallelism via all-to-all communication is applied after TP to ensure head distribution is compatible with sequence splitting areal/experimental/models/archon/qwen3/infra/parallelize.py153-155 areal/experimental/models/archon/qwen2/infra/parallelize.py121-123
Activation Checkpointing: Applied after distributed primitives so that the recomputation logic can correctly handle DTensor operations areal/experimental/models/archon/qwen3/infra/parallelize.py158-164 areal/experimental/models/archon/qwen2/infra/parallelize.py126-132
torch.compile: Applied before FSDP to allow the compiler to optimize the internal graph of transformer blocks areal/experimental/models/archon/qwen3/infra/parallelize.py167-169 areal/experimental/models/archon/qwen2/infra/parallelize.py135-136
FSDP Last: FSDP wraps the already-parallelized sub-modules for parameter sharding and gradient reduction areal/experimental/models/archon/qwen3/infra/parallelize.py172-193 areal/experimental/models/archon/qwen2/infra/parallelize.py139-149

Sources: areal/experimental/models/archon/qwen3/infra/parallelize.py86-193 areal/experimental/models/archon/qwen2/infra/parallelize.py71-156

Pipeline Parallelism and Runners

Archon implements pipeline parallelism (PP) using torch.distributed.pipelining.

Stage Management

The function generate_llm_fqn_per_model_part distributes transformer layers across pipeline stages areal/experimental/models/archon/pipeline_parallel.py85-197 It assigns weights to input modules (tok_embeddings) and output modules (norm, output/score) to balance computational load areal/experimental/models/archon/pipeline_parallel.py97-103

Pipeline Runners and Schedules

Archon supports various pipeline schedules via build_pipeline_schedule areal/experimental/models/archon/pipeline_parallel.py48-82 The execution is managed by PipelinedRunner areal/experimental/engine/archon_runner.py124-142

Key schedules include:

Single-stage schedules: e.g., "1F1B" using PipelineScheduleSingle areal/experimental/models/archon/pipeline_parallel.py18
Zero-Bubble Schedules: Supports ScheduleZBVZeroBubble and ScheduleDualPipeV for minimized bubbles areal/experimental/models/archon/pipeline_parallel.py19-20

The PipelinedRunner includes a memory optimization hack that patches schedule._merge_outputs to skip the torch.cat operation for microbatch outputs, significantly reducing peak memory usage areal/experimental/engine/archon_runner.py162-175

Model Specifications

The ModelSpec dataclass registers model-specific parallelization and pipelining logic areal/experimental/models/archon/model_spec.py86-96 It allows the engine to retrieve the correct ParallelizeFn areal/experimental/models/archon/model_spec.py26-52 and PipeliningFn areal/experimental/models/archon/model_spec.py55-82 based on the HuggingFace model_type areal/experimental/models/archon/model_spec.py113-118

Sources: areal/experimental/models/archon/pipeline_parallel.py9-197 areal/experimental/models/archon/model_spec.py5-138 areal/experimental/engine/archon_runner.py124-175

MoE Expert Parallelism

Archon supports Expert Parallelism (EP) and Expert Tensor Parallelism (ETP) strategies within the MoE module areal/experimental/models/archon/moe/moe.py40-61

Strategy Selection Matrix

EP	TP	etp	Strategy	Expert Weight Sharding
1	1	-	None	`Replicate`
1	>1	-	`TensorParallel`	`[Shard(1/2)]`
>1	1	-	`ExpertParallel`	`[Shard(0)]`
>1	>1	1	`ExpertParallel`	`[Shard(0)]` (TP borrowed by EP)
>1	>1	tp	`ExpertTensorParallel`	`[Shard(0), Shard(1/2)]`

Sources: areal/experimental/models/archon/parallel_dims.py49-57

Implementation Logic

The MoE layer uses a TokenChoiceTopKRouter to assign tokens to experts areal/experimental/models/archon/moe/moe.py74-85 If EP is enabled, dispatch and combination happen via hooks registered by ExpertParallel areal/experimental/models/archon/moe/moe.py173-175

When etp=1 and tp>1, the TP dimension is "borrowed" by EP for token dispatch, meaning experts are only sharded by the expert dimension areal/experimental/models/archon/parallel_dims.py60-62 When etp=tp, experts use 2D sharding combining both dimensions areal/experimental/models/archon/parallel_dims.py63-65

Sources: areal/experimental/models/archon/parallel_dims.py49-109 areal/experimental/models/archon/moe/moe.py40-175

Activation Checkpointing (AC)

Archon provides flexible AC configurations via ActivationCheckpointConfig areal/experimental/models/archon/activation_checkpoint.py37-68

AC Modes

Selective (Op-level): Uses create_selective_checkpoint_contexts via _apply_op_sac to save specific operations (like torch.ops.aten.mm.default or attention ops) while recomputing others areal/experimental/models/archon/activation_checkpoint.py115-197 It uses a custom policy that can force recomputation for specific FQNs like moe.router.gate areal/experimental/models/archon/activation_checkpoint.py46-155
Selective (Layer-level): Checkpoints every Nth layer using _apply_layer_sac areal/experimental/models/archon/activation_checkpoint.py87-113
Full: Checkpoints the entire module via _apply_full_ac areal/experimental/models/archon/activation_checkpoint.py199-210

Sources: areal/experimental/models/archon/activation_checkpoint.py37-210

Context Parallelism (Ulysses SP)

Archon implements Ulysses Sequence Parallelism (SP) for context parallelism. This approach uses All-to-All communication to redistribute data between the sequence dimension and the attention head dimension.

Ulysses Data Flow

Diagram: Ulysses Sequence Parallelism in Archon Attention

Implementation Detail

In the Attention module, the set_cp_group method configures the process group for communication areal/experimental/models/archon/qwen3/model/model.py161-174 areal/experimental/models/archon/qwen2/model/model.py99-112 During the forward pass, gather_seq_scatter_heads is called to redistribute xq, xk, and xv before the attention operation areal/experimental/models/archon/qwen2/model/model.py152-172 After attention, gather_heads_scatter_seq restores the original sequence sharding areal/experimental/models/archon/qwen2/model/model.py194-196

Sources: areal/experimental/models/archon/qwen3/model/model.py103-210 areal/experimental/models/archon/qwen2/model/model.py51-201

Weight Synchronization

Archon manages weight synchronization between training and inference engines using XCCL (NCCL/XCCL) process groups.

Weight Update Flow

Diagram: Weight Synchronization via XCCL in Archon

The WeightSyncState manages the process group initialized via init_weight_update_group areal/experimental/engine/archon_weight_sync.py30-92 Weights are collected into buckets and broadcast from the training rank to inference engines areal/experimental/engine/archon_weight_sync.py114-176 The function _get_full_tensor handles the conversion of DTensor and CPU-offloaded parameters to local device tensors before broadcasting areal/experimental/engine/archon_weight_sync.py95-111

Sources: areal/experimental/engine/archon_weight_sync.py30-212

URL: https://deepwiki.com/inclusionAI/AReaL/8.4-archon-parallelism