VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/8-parallelism-and-distribution

⇱ Parallelism and Distribution | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Parallelism and Distribution

This page provides a technical reference for AReaL's multi-dimensional parallelism system, which enables distributed training across multiple GPUs and nodes. It covers the five parallelism dimensions (data, tensor, context, pipeline, expert), their backend-specific implementations, device mesh architecture, and configuration constraints.

For information about specific training backends, see Training Engines. For distributed job scheduling and worker management, see Scheduler and Job Management.

Parallelism Dimensions Overview

AReaL supports five orthogonal parallelism dimensions that can be combined to scale training across thousands of GPUs:

DimensionAbbreviationDescriptionUse Case
Data ParallelDPDifferent workers process different batchesScale throughput with more GPUs
Tensor ParallelTPModel parameters split column/row-wise across devicesModels too large for single GPU memory
Context ParallelCPSequence length split across devices (Ulysses)Long context training (>32k tokens)
Pipeline ParallelPPModel layers split into stages across devicesVery deep models, reduce memory per GPU
Expert ParallelEPMixture-of-Experts experts distributed across devicesMoE models (e.g., Qwen-MoE)

Multi-Dimensional Parallelism Topology

The following diagram illustrates how these dimensions are nested within a global process group to form a training cluster, associating high-level concepts with specific code entities.


Sources: areal/engine/fsdp_engine.py203-214 areal/engine/megatron_engine.py18-27 areal/experimental/engine/archon_engine.py173-185 areal/engine/megatron_engine.py22-29

Parallelism Configuration

ParallelStrategy and ArchonParallelDims

The ParallelStrategy specifies world sizes for each dimension. In the ArchonEngine, this is specialized into ArchonParallelDims which manages the complex interactions between EP and TP. For details, see Parallelism Overview.

DimensionArchonParallelDims AttributeValidation Rule
Data Paralleldp_shardworld_size // (tp * cp * pp)
TensortpMust divide head count
ContextcpUlysses SP size
PipelineppNumber of model stages
ExpertepDivisible by cp or cp * tp

Sources: areal/experimental/models/archon/parallel_dims.py27-42 areal/experimental/models/archon/parallel_dims.py126-141

allocation_mode String Syntax

The allocation_mode provides a compact syntax for specifying backend and parallelism in a single string. It is parsed to determine resource distribution between inference and training.

# Syntax pattern:
"rollout_backend:gpu_spec+train_backend:gpu_spec"

# Examples:
"sglang:d8+fsdp:d4t2" # 8 GPUs for SGLang, 8 GPUs FSDP (4 DP × 2 TP)
"vllm:d16t2+megatron:d2t4p2" # 32 GPUs vLLM, 16 GPUs Megatron (2×4×2)

Sources: areal/api/cli_args.py210-215 docs/en/cli_reference.md104

Backend-Specific Implementations

FSDP: 3D Parallelism (DP × SP × TP)

FSDPEngine uses PyTorch's FSDP2 with a 3-dimensional device mesh. It leverages ParallelHelper to manage the DeviceMesh hierarchy. For details, see FSDP Parallelism.

Key Classes:

Sources: areal/engine/fsdp_engine.py203-214 areal/engine/fsdp_utils/parallel.py86

Megatron: 5D Parallelism

MegatronEngine leverages Megatron-LM's mpu (model parallel utility) for 5D topology. It uses AutoBridge for weight conversion between HuggingFace and Megatron formats.

Sources: areal/engine/megatron_engine.py18-27 areal/engine/megatron_engine.py180-185

Archon: Native PyTorch 5D Parallelism

ArchonEngine is a custom backend implementing TP, CP, EP, DP, and PP using torch.distributed.tensor (DTensor) and torch.distributed.pipelining. For details, see Archon Parallelism.


Key Components:

Sources: areal/experimental/engine/archon_engine.py18-22 areal/experimental/models/archon/qwen3/infra/parallelize.py86-172 areal/experimental/engine/archon_runner.py120-135

Device Mesh and Process Groups

Explicit Process Group Management

AReaL avoids using global process groups for collectives to enable multi-engine co-location (e.g., running Actor and Critic on the same nodes but different groups). For details, see Device Mesh and Process Groups.

CPU Group for Synchronization: All engines create a separate Gloo-based CPU group for metadata exchange and recovery coordination.


Sources: areal/engine/fsdp_engine.py228-230 areal/experimental/engine/archon_engine.py211-213

Constraint Validation

Parallelism Constraints

The system validates that model architectures are compatible with requested sharding. For details, see Parallelism Constraint Validation.

Sources: areal/experimental/models/archon/parallel_dims.py138-164 areal/experimental/engine/archon_engine.py161-165

Context Parallelism (Ulysses)

AReaL implements Ulysses sequence parallelism. Unlike Ring Attention, Ulysses keeps the full sequence on each GPU during attention but shards across the head dimension using All-to-All communication. For details, see Context Parallel (Ulysses).

Key Operations:

Sources: areal/models/fsdp/ulysses.py89-94 areal/experimental/models/archon/ulysses.py79-82 areal/experimental/models/archon/qwen3/model/model.py103-110

MoE and Expert Parallelism

For Mixture-of-Experts, AReaL supports distributing experts across devices using Expert Parallelism (EP). For details, see MoE and Expert Parallelism.

StrategyEPTPExpert Weight Sharding
TensorParallel1>1[Shard(1/2)]
ExpertParallel>11[Shard(0)]
ExpertTensorParallel>1>1[Shard(0), Shard(1/2)]

Sources: areal/experimental/models/archon/parallel_dims.py51-57 areal/engine/megatron_engine.py128-132 areal/experimental/models/archon/moe/moe.py40-61