VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/8.1-parallelism-overview

⇱ Parallelism Overview | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Parallelism Overview

Purpose and Scope

This page introduces the multi-dimensional parallelism strategies used in AReaL to scale training and inference across multiple GPUs. It covers the five core parallelism dimensions (Data, Tensor, Pipeline, Context, Expert), how they compose to form a distributed device mesh, and how to specify them via the configuration system.

AReaL supports three primary training backends—FSDP, Megatron, and Archon—each offering different levels of support for these dimensions.

Sources: areal/api/cli_args.py39-136 areal/engine/fsdp_engine.py218-221 areal/experimental/engine/archon_engine.py147-194 areal/engine/megatron_engine.py168-186


Why Parallelism

Training large language models requires distributing computation across multiple GPUs due to:

  1. Memory constraints: Model parameters, activations, and optimizer states exceed single GPU memory.
  2. Compute throughput: Single GPU training is too slow for practical iteration on billion-parameter models.
  3. Batch size scaling: Larger effective batch sizes (via Data Parallelism) improve training stability and convergence.

AReaL supports five orthogonal parallelism dimensions that can be combined to efficiently utilize clusters ranging from single nodes (8 GPUs) to hundreds of GPUs.

Sources: areal/api/cli_args.py100-136 areal/engine/fsdp_engine.py218-221 areal/utils/data.py116-127


Parallelism Dimensions

AReaL implements five parallelism dimensions that can be combined:

DimensionAbbrevDescriptionMemory ImpactCommunication Pattern
Data Parallel (DP)dReplicate model across devices, split batch.Low (duplicates params/grads).All-reduce gradients.
Tensor Parallel (TP)tSplit individual layers/tensors horizontally.Reduces param/activation memory.All-reduce/All-gather per layer.
Pipeline Parallel (PP)pSplit model layers vertically into stages.Reduces param memory per stage.P2P (send/recv) between stages.
Context Parallel (CP)cSplit sequence length (Ulysses).Reduces activation memory.All-to-all in attention.
Expert Parallel (EP)eSplit MoE experts across devices.Reduces expert param memory.All-to-all for token routing.

Implementation Details

Sources: areal/api/cli_args.py100-136 areal/experimental/engine/archon_engine.py147-194 areal/engine/megatron_engine.py168-186 areal/engine/fsdp_engine.py89-94


World Size Calculation

The total number of GPUs required (world size) is computed as:

world_size = dp × tp × pp × cp

Note on Expert Parallel (ep): In AReaL's ArchonEngine, Expert Parallelism uses Dimension Borrowing. The ep size must be compatible with the mesh formed by dp_shard, cp, and tp.

Configuration Examples

Parallelism ConfigdptpppcpWorld SizeNotes
d881118Standard Data Parallel.
d2t424118DP + TP (common for 70B models).
d2p2t4242116DP + PP + TP.
d4t2c2421216DP + TP + CP (long context).

Sources: areal/api/cli_args.py100-136 areal/engine/megatron_engine.py180-209


Device Mesh Structure

AReaL organizes GPUs into a multi-dimensional device mesh. The implementation varies by engine:

FSDPEngine

Uses a 3D mesh: [dp, sp, tp] where sp is sequence (context) parallelism.

MegatronEngine

Uses Megatron-LM's internal mpu (model parallel utilities).

ArchonEngine

Uses a 4D DeviceMesh: [dp, pp, tp, cp].


Diagram: Logical GPU grouping for d2p2t2

Sources: areal/engine/fsdp_engine.py209-215 areal/engine/megatron_engine.py180-209 areal/experimental/engine/archon_engine.py175-187


Specifying Parallelism: allocation_mode

The allocation_mode string is the primary way to configure distribution. It defines the backend and the size of each dimension. areal/api/cli_args.py102

Syntax Overview

[backend]:[dimensions]

Multiple pools (e.g., separate inference and training) are separated by +.

Examples

  • "d8": Auto-selects FSDP with 8-way Data Parallel.
  • "fsdp:d4t2": Explicit FSDP with 4-way DP and 2-way TP.
  • "megatron:d2p2t4": Megatron with Pipeline Parallelism.
  • "archon:d2p2t2c2": Archon with 4D parallelism.

The string is parsed into a ParallelStrategy object which is then passed to the engine initialization. areal/api/cli_args.py48-51

Sources: areal/api/cli_args.py48-51 areal/engine/megatron_engine.py180-185


Configuration to Code Entity Mapping

The following diagram bridges the natural language configuration to the specific code classes and functions that implement the parallelism.


Diagram: Mapping configuration to engine implementation

Sources: areal/api/cli_args.py48-55 areal/engine/fsdp_engine.py218-221 areal/engine/megatron_engine.py168-175 areal/experimental/engine/archon_engine.py150-187 areal/experimental/engine/archon_runner.py56


Backend Support Matrix

FeatureFSDPEngineMegatronEngineArchonEngine
Data ParallelFully Sharded (FSDP2)DistributedDataParallel (DDP)Fully Sharded (FSDP2)
Tensor ParallelDTensor-basedMegatron-Core TPDTensor-based
Pipeline ParallelMegatron-Core PPPipelineStage (native)
Context ParallelUlyssesPacked CPUlysses
Expert ParallelMegatron-Core EPExpertParallel (native)
OffloadingCPUOffloadPolicytorch_memory_savertorch_memory_saver

Key Backend-Specific Classes

Sources: areal/engine/fsdp_engine.py218-240 areal/engine/megatron_engine.py17-30 areal/experimental/engine/archon_engine.py16-22


Data Flow in Parallel Training

The following diagram illustrates how a micro-batch flows through a parallelized engine (e.g., Archon or Megatron).


Diagram: Parallel Data Flow (PP + TP + DP)

Sources: areal/utils/data.py122-125 areal/engine/megatron_engine.py23-29 areal/experimental/engine/archon_engine.py181-187