Last indexed: 7 May 2026 (2e12c1)

Parallelism and Distribution

This page provides a technical reference for AReaL's multi-dimensional parallelism system, which enables distributed training across multiple GPUs and nodes. It covers the five parallelism dimensions (data, tensor, context, pipeline, expert), their backend-specific implementations, device mesh architecture, and configuration constraints.

For information about specific training backends, see Training Engines. For distributed job scheduling and worker management, see Scheduler and Job Management.

Parallelism Dimensions Overview

AReaL supports five orthogonal parallelism dimensions that can be combined to scale training across thousands of GPUs:

Dimension	Abbreviation	Description	Use Case
Data Parallel	DP	Different workers process different batches	Scale throughput with more GPUs
Tensor Parallel	TP	Model parameters split column/row-wise across devices	Models too large for single GPU memory
Context Parallel	CP	Sequence length split across devices (Ulysses)	Long context training (>32k tokens)
Pipeline Parallel	PP	Model layers split into stages across devices	Very deep models, reduce memory per GPU
Expert Parallel	EP	Mixture-of-Experts experts distributed across devices	MoE models (e.g., Qwen-MoE)

Multi-Dimensional Parallelism Topology

The following diagram illustrates how these dimensions are nested within a global process group to form a training cluster, associating high-level concepts with specific code entities.

Sources: areal/engine/fsdp_engine.py203-214 areal/engine/megatron_engine.py18-27 areal/experimental/engine/archon_engine.py173-185 areal/engine/megatron_engine.py22-29

Parallelism Configuration

ParallelStrategy and ArchonParallelDims

The ParallelStrategy specifies world sizes for each dimension. In the ArchonEngine, this is specialized into ArchonParallelDims which manages the complex interactions between EP and TP. For details, see Parallelism Overview.

Dimension	`ArchonParallelDims` Attribute	Validation Rule
Data Parallel	`dp_shard`	`world_size // (tp * cp * pp)`
Tensor	`tp`	Must divide head count
Context	`cp`	Ulysses SP size
Pipeline	`pp`	Number of model stages
Expert	`ep`	Divisible by `cp` or `cp * tp`

Sources: areal/experimental/models/archon/parallel_dims.py27-42 areal/experimental/models/archon/parallel_dims.py126-141

allocation_mode String Syntax

The allocation_mode provides a compact syntax for specifying backend and parallelism in a single string. It is parsed to determine resource distribution between inference and training.

# Syntax pattern:
"rollout_backend:gpu_spec+train_backend:gpu_spec"

# Examples:
"sglang:d8+fsdp:d4t2" # 8 GPUs for SGLang, 8 GPUs FSDP (4 DP × 2 TP)
"vllm:d16t2+megatron:d2t4p2" # 32 GPUs vLLM, 16 GPUs Megatron (2×4×2)

Sources: areal/api/cli_args.py210-215 docs/en/cli_reference.md104

Backend-Specific Implementations

FSDP: 3D Parallelism (DP × SP × TP)

FSDPEngine uses PyTorch's FSDP2 with a 3-dimensional device mesh. It leverages ParallelHelper to manage the DeviceMesh hierarchy. For details, see FSDP Parallelism.

Key Classes:

ParallelHelper: Manages mesh creation and group extraction areal/engine/fsdp_utils/parallel.py86
FSDPTrainContext: Tracks Ulysses padding and sequence slicing through the pipeline areal/engine/fsdp_engine.py141-162

Sources: areal/engine/fsdp_engine.py203-214 areal/engine/fsdp_utils/parallel.py86

Megatron: 5D Parallelism

MegatronEngine leverages Megatron-LM's mpu (model parallel utility) for 5D topology. It uses AutoBridge for weight conversion between HuggingFace and Megatron formats.

Sources: areal/engine/megatron_engine.py18-27 areal/engine/megatron_engine.py180-185

Archon: Native PyTorch 5D Parallelism

ArchonEngine is a custom backend implementing TP, CP, EP, DP, and PP using torch.distributed.tensor (DTensor) and torch.distributed.pipelining. For details, see Archon Parallelism.

Key Components:

create_runner: Orchestrates the application of all parallel layers to the model areal/experimental/engine/archon_runner.py120-135
ForwardBackwardRunner: Executes the pipeline schedule (e.g., ScheduleZBVZeroBubble) areal/experimental/engine/archon_engine.py173

Sources: areal/experimental/engine/archon_engine.py18-22 areal/experimental/models/archon/qwen3/infra/parallelize.py86-172 areal/experimental/engine/archon_runner.py120-135

Device Mesh and Process Groups

Explicit Process Group Management

AReaL avoids using global process groups for collectives to enable multi-engine co-location (e.g., running Actor and Critic on the same nodes but different groups). For details, see Device Mesh and Process Groups.

CPU Group for Synchronization: All engines create a separate Gloo-based CPU group for metadata exchange and recovery coordination.

Sources: areal/engine/fsdp_engine.py228-230 areal/experimental/engine/archon_engine.py211-213

Constraint Validation

Parallelism Constraints

The system validates that model architectures are compatible with requested sharding. For details, see Parallelism Constraint Validation.

TP Constraints: Validates that head counts are divisible by TP size areal/experimental/engine/archon_engine.py161-165
EP Constraints: Validates expert counts and mesh compatibility areal/experimental/models/archon/parallel_dims.py144-156

Sources: areal/experimental/models/archon/parallel_dims.py138-164 areal/experimental/engine/archon_engine.py161-165

Context Parallelism (Ulysses)

AReaL implements Ulysses sequence parallelism. Unlike Ring Attention, Ulysses keeps the full sequence on each GPU during attention but shards across the head dimension using All-to-All communication. For details, see Context Parallel (Ulysses).

Key Operations:

ulysses_slice_inputs: Slices inputs for sequence parallelism areal/experimental/models/archon/ulysses.py81-82
ulysses_gather_output: All-to-All to restore sequence sharding after attention areal/experimental/models/archon/ulysses.py80

Sources: areal/models/fsdp/ulysses.py89-94 areal/experimental/models/archon/ulysses.py79-82 areal/experimental/models/archon/qwen3/model/model.py103-110

MoE and Expert Parallelism

For Mixture-of-Experts, AReaL supports distributing experts across devices using Expert Parallelism (EP). For details, see MoE and Expert Parallelism.

Strategy	EP	TP	Expert Weight Sharding
TensorParallel	1	>1	`[Shard(1/2)]`
ExpertParallel	>1	1	`[Shard(0)]`
ExpertTensorParallel	>1	>1	`[Shard(0), Shard(1/2)]`

Sources: areal/experimental/models/archon/parallel_dims.py51-57 areal/engine/megatron_engine.py128-132 areal/experimental/models/archon/moe/moe.py40-61

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/8-parallelism-and-distribution