VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/8.7-moe-and-expert-parallelism

⇱ MoE and Expert Parallelism | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

MoE and Expert Parallelism

This page describes MoE (Mixture of Experts) parallelism strategies in AReaL, including Expert Parallel (EP) and Expert Tensor Parallel (ETP). These techniques enable efficient training of large MoE models by distributing expert weights across GPUs.

For information about other parallelism strategies (TP, DP, PP, CP), see 8.1 Parallelism Overview For general Archon Engine architecture, see 3.4 ArchonEngine


Overview

MoE parallelism distributes expert computation across GPUs through two primary mechanisms:

  • Expert Parallel (EP): Shards experts across GPUs along the expert dimension. Each GPU holds a subset of experts (e.g., experts 0-3 on GPU 0, experts 4-7 on GPU 1).
  • Expert Tensor Parallel (ETP): Further shards each expert's weights using tensor parallelism within the expert parallel group.

AReaL provides support for MoE models primarily through the Archon Engine, which features a custom MoE implementation with EP/ETP configured via ArchonParallelDims areal/experimental/engine/archon_engine.py171-176 It supports features like higher-precision routing via RouterGatingLinearFunction areal/experimental/models/archon/moe/router.py14-52 and flexible expert sharding.

MoE layers can coexist with dense FFN layers in the same model, determined by decoder_sparse_step (Archon) areal/experimental/models/archon/qwen3/model/args.py23-25

Sources: areal/experimental/models/archon/moe/router.py14-52 areal/experimental/engine/archon_engine.py171-176 areal/experimental/models/archon/qwen3/model/args.py23-25

Diagram: MoE Parallelism Strategies


Sources: areal/experimental/models/archon/moe/router.py109-132


Expert Parallel (EP) Architecture

Expert Parallel shards the expert dimension across GPUs. Each GPU processes a subset of experts, reducing memory requirements and enabling larger models.

ExpertParallel Implementation

The ExpertParallel class implements EP with an ETP=1 configuration (no tensor parallelism within experts) areal/experimental/models/archon/expert_parallel.py70-79 It shards expert weights on dimension 0 across the EP ranks using distribute_tensor with Shard(0) areal/experimental/models/archon/expert_parallel.py102-104

Token dispatch is managed via _token_dispatch, which uses all_to_all_single to exchange token counts and all_to_all_single_autograd to route the actual token data areal/experimental/models/archon/expert_parallel.py106-168

Sources: areal/experimental/models/archon/expert_parallel.py70-168

Diagram: EP Process Groups and Communication


Sources: areal/experimental/models/archon/expert_parallel.py29-204 areal/experimental/models/archon/moe/router.py109-132


Expert Tensor Parallel (ETP)

Expert Tensor Parallel applies 2D sharding to expert weights. While EP shards the expert dimension (dimension 0), ETP additionally shards the hidden dimensions of each expert.

ETP Sharding Strategy

Expert weights are stored as 3D tensors in GroupedExperts:

With ETP enabled:

  • Dimension 0: Shard across EP group (expert parallel).
  • Dimension 1/2: Shard across ETP group (tensor parallel within the expert).

Diagram: ETP Weight Sharding


Sources: areal/experimental/models/archon/moe/grouped_experts.py195-203


MoE Weight Conversion

AReaL provides utilities for converting between HuggingFace MoE weights and distributed engine formats.

Megatron/MCore Conversion

For Megatron-based MoE models (e.g., Bailing MoE), AReaL handles weight slicing and expert distribution:

Archon Conversion

The Archon engine uses moe_weight_converter.py to map HuggingFace expert indices to the distributed GroupedExperts format areal/experimental/models/archon/moe_weight_converter.py1-50

Sources: areal/models/mcore/bailing_moe_bridge.py84-143 areal/models/mcore/bailing_moe.py9-13 areal/experimental/models/archon/moe_weight_converter.py1-50


MoE Module Structure

The MoE class implements the core Mixture of Experts layer with token routing, expert computation, and output combination.

Token Choice Router

The TokenChoiceTopKRouter handles expert selection areal/experimental/models/archon/moe/router.py109-132:

GroupedExperts

GroupedExperts stores expert weights as 3D tensors, enabling efficient batch computation areal/experimental/models/archon/moe/grouped_experts.py195-203 It supports:

Sources: areal/experimental/models/archon/moe/router.py109-200 areal/experimental/models/archon/moe/grouped_experts.py15-203


Distributed Checkpointing

MoE models require specialized checkpointing to handle sharded experts.

Sources: areal/models/mcore/bailing_moe_bridge.py84-131 areal/utils/saver.py160-181 areal/utils/recover.py41-145