MoE and Expert Parallelism

This page describes MoE (Mixture of Experts) parallelism strategies in AReaL, including Expert Parallel (EP) and Expert Tensor Parallel (ETP). These techniques enable efficient training of large MoE models by distributing expert weights across GPUs.

For information about other parallelism strategies (TP, DP, PP, CP), see 8.1 Parallelism Overview For general Archon Engine architecture, see 3.4 ArchonEngine

Overview

MoE parallelism distributes expert computation across GPUs through two primary mechanisms:

Expert Parallel (EP): Shards experts across GPUs along the expert dimension. Each GPU holds a subset of experts (e.g., experts 0-3 on GPU 0, experts 4-7 on GPU 1).
Expert Tensor Parallel (ETP): Further shards each expert's weights using tensor parallelism within the expert parallel group.

AReaL provides support for MoE models primarily through the Archon Engine, which features a custom MoE implementation with EP/ETP configured via ArchonParallelDims areal/experimental/engine/archon_engine.py171-176 It supports features like higher-precision routing via RouterGatingLinearFunction areal/experimental/models/archon/moe/router.py14-52 and flexible expert sharding.

MoE layers can coexist with dense FFN layers in the same model, determined by decoder_sparse_step (Archon) areal/experimental/models/archon/qwen3/model/args.py23-25

Sources: areal/experimental/models/archon/moe/router.py14-52 areal/experimental/engine/archon_engine.py171-176 areal/experimental/models/archon/qwen3/model/args.py23-25

Diagram: MoE Parallelism Strategies

Sources: areal/experimental/models/archon/moe/router.py109-132

Expert Parallel (EP) Architecture

Expert Parallel shards the expert dimension across GPUs. Each GPU processes a subset of experts, reducing memory requirements and enabling larger models.

ExpertParallel Implementation

The ExpertParallel class implements EP with an ETP=1 configuration (no tensor parallelism within experts) areal/experimental/models/archon/expert_parallel.py70-79 It shards expert weights on dimension 0 across the EP ranks using distribute_tensor with Shard(0) areal/experimental/models/archon/expert_parallel.py102-104

Token dispatch is managed via _token_dispatch, which uses all_to_all_single to exchange token counts and all_to_all_single_autograd to route the actual token data areal/experimental/models/archon/expert_parallel.py106-168

Sources: areal/experimental/models/archon/expert_parallel.py70-168

Diagram: EP Process Groups and Communication

Sources: areal/experimental/models/archon/expert_parallel.py29-204 areal/experimental/models/archon/moe/router.py109-132

Expert Tensor Parallel (ETP)

Expert Tensor Parallel applies 2D sharding to expert weights. While EP shards the expert dimension (dimension 0), ETP additionally shards the hidden dimensions of each expert.

ETP Sharding Strategy

Expert weights are stored as 3D tensors in GroupedExperts:

w1, w2, w3: Weight tensors for the expert MLP areal/experimental/models/archon/moe/grouped_experts.py195-203

With ETP enabled:

Dimension 0: Shard across EP group (expert parallel).
Dimension 1/2: Shard across ETP group (tensor parallel within the expert).

Diagram: ETP Weight Sharding

Sources: areal/experimental/models/archon/moe/grouped_experts.py195-203

MoE Weight Conversion

AReaL provides utilities for converting between HuggingFace MoE weights and distributed engine formats.

Megatron/MCore Conversion

For Megatron-based MoE models (e.g., Bailing MoE), AReaL handles weight slicing and expert distribution:

BailingMoeBridge registers mappings for MoE experts and routers areal/models/mcore/bailing_moe_bridge.py84-131
It supports heterogeneous layers where some layers are dense and others are MoE based on moe_layer_freq areal/models/mcore/bailing_moe_bridge.py138-143
Bailing MoE Support: Specifically handles Lightning Attention and MLA mixed with MoE experts areal/models/mcore/bailing_moe.py9-13

Archon Conversion

The Archon engine uses moe_weight_converter.py to map HuggingFace expert indices to the distributed GroupedExperts format areal/experimental/models/archon/moe_weight_converter.py1-50

Sources: areal/models/mcore/bailing_moe_bridge.py84-143 areal/models/mcore/bailing_moe.py9-13 areal/experimental/models/archon/moe_weight_converter.py1-50

MoE Module Structure

The MoE class implements the core Mixture of Experts layer with token routing, expert computation, and output combination.

Token Choice Router

The TokenChoiceTopKRouter handles expert selection areal/experimental/models/archon/moe/router.py109-132:

Precision: RouterGatingLinearFunction performs GEMM in router_dtype (typically FP32) while saving activations in the original dtype (e.g., BF16) areal/experimental/models/archon/moe/router.py14-52
Node-Limited Routing: Divides experts into groups and selects a subset of groups (num_limited_groups) before performing top-k selection areal/experimental/models/archon/moe/router.py181-200
Debug Mode: _debug_force_load_balance_routing forces uniform round-robin assignment for testing areal/experimental/models/archon/moe/router.py159-179

GroupedExperts

GroupedExperts stores expert weights as 3D tensors, enabling efficient batch computation areal/experimental/models/archon/moe/grouped_experts.py195-203 It supports:

_run_experts_grouped_mm: Efficient implementation using torch._grouped_mm for BF16 on CUDA areal/experimental/models/archon/moe/grouped_experts.py70-111
_run_experts_for_loop: Fallback implementation for environments without grouped_mm support areal/experimental/models/archon/moe/grouped_experts.py15-67
FP8 Support: Supports per-expert FP8 computation via _run_experts_fp8_for_loop areal/experimental/models/archon/moe/grouped_experts.py114-181

Sources: areal/experimental/models/archon/moe/router.py109-200 areal/experimental/models/archon/moe/grouped_experts.py15-203

Distributed Checkpointing

MoE models require specialized checkpointing to handle sharded experts.

Megatron Checkpointing: Handles sharded expert weights during model initialization and state dict loading areal/models/mcore/bailing_moe_bridge.py84-131
Archon Async Checkpointing: Supports asynchronous checkpointing via AsyncCheckpointManager to minimize training downtime areal/utils/saver.py160-181 It uses save_model_to_hf to handle the conversion from sharded Archon weights to standard HF format areal/utils/saver.py178-180
Recovery: The RecoverHandler and RecoverInfo classes manage state restoration for distributed runs, including dataloader states across different ranks areal/utils/recover.py41-145

Sources: areal/models/mcore/bailing_moe_bridge.py84-131 areal/utils/saver.py160-181 areal/utils/recover.py41-145

URL: https://deepwiki.com/inclusionAI/AReaL/8.7-moe-and-expert-parallelism