Last indexed: 7 May 2026 (2e12c1)

Parallelism Overview

Purpose and Scope

This page introduces the multi-dimensional parallelism strategies used in AReaL to scale training and inference across multiple GPUs. It covers the five core parallelism dimensions (Data, Tensor, Pipeline, Context, Expert), how they compose to form a distributed device mesh, and how to specify them via the configuration system.

AReaL supports three primary training backends—FSDP, Megatron, and Archon—each offering different levels of support for these dimensions.

Sources: areal/api/cli_args.py39-136 areal/engine/fsdp_engine.py218-221 areal/experimental/engine/archon_engine.py147-194 areal/engine/megatron_engine.py168-186

Why Parallelism

Training large language models requires distributing computation across multiple GPUs due to:

Memory constraints: Model parameters, activations, and optimizer states exceed single GPU memory.
Compute throughput: Single GPU training is too slow for practical iteration on billion-parameter models.
Batch size scaling: Larger effective batch sizes (via Data Parallelism) improve training stability and convergence.

AReaL supports five orthogonal parallelism dimensions that can be combined to efficiently utilize clusters ranging from single nodes (8 GPUs) to hundreds of GPUs.

Sources: areal/api/cli_args.py100-136 areal/engine/fsdp_engine.py218-221 areal/utils/data.py116-127

Parallelism Dimensions

AReaL implements five parallelism dimensions that can be combined:

Dimension	Abbrev	Description	Memory Impact	Communication Pattern
Data Parallel (DP)	`d`	Replicate model across devices, split batch.	Low (duplicates params/grads).	All-reduce gradients.
Tensor Parallel (TP)	`t`	Split individual layers/tensors horizontally.	Reduces param/activation memory.	All-reduce/All-gather per layer.
Pipeline Parallel (PP)	`p`	Split model layers vertically into stages.	Reduces param memory per stage.	P2P (send/recv) between stages.
Context Parallel (CP)	`c`	Split sequence length (Ulysses).	Reduces activation memory.	All-to-all in attention.
Expert Parallel (EP)	`e`	Split MoE experts across devices.	Reduces expert param memory.	All-to-all for token routing.

Implementation Details

DP: Increases throughput by processing more samples. FSDPEngine utilizes fully_shard for memory-efficient data parallelism areal/engine/fsdp_engine.py86
TP: Shards parameters of a single layer. ArchonEngine leverages DTensor for horizontal sharding areal/experimental/engine/archon_engine.py179
PP: Enables models larger than a single GPU's memory by splitting layers. ArchonEngine uses torch.distributed.pipelining schedules like ScheduleZBVZeroBubble areal/experimental/engine/archon_engine.py18-22
CP: Enables training with long sequences (e.g., 32k+) by sharding the sequence dimension using the Ulysses algorithm. This is implemented via ulysses_pad_and_slice_inputs in FSDP areal/engine/fsdp_engine.py92 and ulysses_slice_inputs in Archon areal/experimental/models/archon/ulysses.py81-82
EP: Scales MoE models by distributing experts. MegatronEngine supports EP via Megatron-Core utilities areal/engine/megatron_engine.py22-29

Sources: areal/api/cli_args.py100-136 areal/experimental/engine/archon_engine.py147-194 areal/engine/megatron_engine.py168-186 areal/engine/fsdp_engine.py89-94

World Size Calculation

The total number of GPUs required (world size) is computed as:

world_size = dp × tp × pp × cp

Note on Expert Parallel (ep): In AReaL's ArchonEngine, Expert Parallelism uses Dimension Borrowing. The ep size must be compatible with the mesh formed by dp_shard, cp, and tp.

Configuration Examples

Parallelism Config	dp	tp	pp	cp	World Size	Notes
`d8`	8	1	1	1	8	Standard Data Parallel.
`d2t4`	2	4	1	1	8	DP + TP (common for 70B models).
`d2p2t4`	2	4	2	1	16	DP + PP + TP.
`d4t2c2`	4	2	1	2	16	DP + TP + CP (long context).

Sources: areal/api/cli_args.py100-136 areal/engine/megatron_engine.py180-209

Device Mesh Structure

AReaL organizes GPUs into a multi-dimensional device mesh. The implementation varies by engine:

FSDPEngine

Uses a 3D mesh: [dp, sp, tp] where sp is sequence (context) parallelism.

Initialization: ParallelHelper in areal/engine/fsdp_utils/parallel.py areal/engine/fsdp_engine.py86
Mesh Storage: self.world_mesh in FSDPEngine areal/engine/fsdp_engine.py209

MegatronEngine

Uses Megatron-LM's internal mpu (model parallel utilities).

Initialization: mpu.initialize_model_parallel() is called within create_process_group areal/engine/megatron_engine.py22
Groups: Uses megatron.core.parallel_state to manage tp, pp, and dp groups areal/engine/megatron_engine.py22-29

ArchonEngine

Uses a 4D DeviceMesh: [dp, pp, tp, cp].

Mesh Management: self._world_mesh stores the distributed topology areal/experimental/engine/archon_engine.py179

Diagram: Logical GPU grouping for d2p2t2

Sources: areal/engine/fsdp_engine.py209-215 areal/engine/megatron_engine.py180-209 areal/experimental/engine/archon_engine.py175-187

Specifying Parallelism: allocation_mode

The allocation_mode string is the primary way to configure distribution. It defines the backend and the size of each dimension. areal/api/cli_args.py102

Syntax Overview

[backend]:[dimensions]

Multiple pools (e.g., separate inference and training) are separated by +.

Examples

"d8": Auto-selects FSDP with 8-way Data Parallel.
"fsdp:d4t2": Explicit FSDP with 4-way DP and 2-way TP.
"megatron:d2p2t4": Megatron with Pipeline Parallelism.
"archon:d2p2t2c2": Archon with 4D parallelism.

The string is parsed into a ParallelStrategy object which is then passed to the engine initialization. areal/api/cli_args.py48-51

Sources: areal/api/cli_args.py48-51 areal/engine/megatron_engine.py180-185

Configuration to Code Entity Mapping

The following diagram bridges the natural language configuration to the specific code classes and functions that implement the parallelism.

Diagram: Mapping configuration to engine implementation

Sources: areal/api/cli_args.py48-55 areal/engine/fsdp_engine.py218-221 areal/engine/megatron_engine.py168-175 areal/experimental/engine/archon_engine.py150-187 areal/experimental/engine/archon_runner.py56

Backend Support Matrix

Feature	FSDPEngine	MegatronEngine	ArchonEngine
Data Parallel	Fully Sharded (FSDP2)	DistributedDataParallel (DDP)	Fully Sharded (FSDP2)
Tensor Parallel	DTensor-based	Megatron-Core TP	DTensor-based
Pipeline Parallel	✗	Megatron-Core PP	`PipelineStage` (native)
Context Parallel	Ulysses	Packed CP	Ulysses
Expert Parallel	✗	Megatron-Core EP	`ExpertParallel` (native)
Offloading	`CPUOffloadPolicy`	`torch_memory_saver`	`torch_memory_saver`

Key Backend-Specific Classes

FSDP: ParallelHelper areal/engine/fsdp_utils/parallel.py handles the logic for fully_shard and parallelize_model areal/engine/fsdp_engine.py86
Megatron: MegatronBridgeAutoBridge areal/engine/megatron_engine.py20 facilitates the conversion between HuggingFace and Megatron-Core formats.
Archon: ForwardBackwardRunner areal/experimental/engine/archon_runner.py manages the execution of pipeline schedules like ScheduleZBVZeroBubble areal/experimental/engine/archon_engine.py20 or ScheduleDualPipeV areal/experimental/engine/archon_engine.py19

Sources: areal/engine/fsdp_engine.py218-240 areal/engine/megatron_engine.py17-30 areal/experimental/engine/archon_engine.py16-22

Data Flow in Parallel Training

The following diagram illustrates how a micro-batch flows through a parallelized engine (e.g., Archon or Megatron).

Diagram: Parallel Data Flow (PP + TP + DP)

Sources: areal/utils/data.py122-125 areal/engine/megatron_engine.py23-29 areal/experimental/engine/archon_engine.py181-187

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/8.1-parallelism-overview