Last indexed: 7 May 2026 (2e12c1)

Device Mesh and Process Groups

This page explains AReaL's approach to distributed training infrastructure through device meshes and explicit process groups. DeviceMesh provides a multi-dimensional abstraction for organizing GPU ranks across data parallel (DP), tensor parallel (TP), context parallel (CP), expert parallel (EP), and pipeline parallel (PP) dimensions. Process groups define communication domains for collective operations like all-reduce and broadcast.

For information about specific parallelism strategies (FSDP, Megatron, Archon), see pages 3.2, 3.3, and 3.4. For parallelism constraint validation, see 8.8.

Core Concepts

DeviceMesh Abstraction

DeviceMesh is PyTorch's abstraction for organizing distributed ranks into a multi-dimensional grid. In AReaL, device meshes map logical parallelism dimensions to physical GPU ranks, enabling efficient extraction of specific process groups for different communication patterns.

Key characteristics:

Multi-dimensional indexing: Access ranks by dimension name (e.g., mesh["dp"], mesh["tp"]) .
Automatic process group creation: PyTorch creates process groups for each mesh dimension.
Submesh extraction: Extract lower-dimensional meshes for specific operations .
Never global: AReaL explicitly avoids using PyTorch's default global process group for core operations.

Dimension ordering matters: Different backends use different ordering conventions:

Backend	Dimension Order	Mesh Shape
FSDP	`dp × sp × tp`	3D mesh
Megatron	`tp-cp-ep-dp-pp`	Uses Megatron's internal ordering
Archon	`dp_shard × tp × cp × ep × pp`	5D mesh

Sources: , ,

Code Entity Mapping: DeviceMesh Structure

The following diagram maps high-level parallelism concepts to the specific code entities that manage them within AReaL.

Title: "Parallelism Code Entity Mapping"

Sources: , ,

Process Groups

Process groups define communication domains for distributed operations. Each group contains a subset of ranks that can communicate with each other. AReaL creates explicit process groups for each parallelism dimension rather than relying on PyTorch's default global group.

Process group types in AReaL:

Group Type	Purpose	Created By
`dp_group`	Data parallel all-reduce for gradients	All engines
`sp_group`	Sequence parallel (Ulysses) operations	FSDP
`tp_group`	Tensor parallel collective operations	All engines
`cp_group`	Context parallel (Ulysses SP)	Archon/Megatron
`ep_group`	Expert parallel for MoE models	Archon/Megatron
`pp_group`	Pipeline parallel stages	Archon/Megatron
`_cpu_group`	CPU-backed barrier synchronization	All engines

Explicit Group Management Principle: AReaL never uses the global process group (dist.all_reduce() without group=). All collective operations explicitly specify the process group to avoid synchronization deadlocks across different model components.

Sources: , ,

Archon Engine Device Mesh

Archon uses a torch-native implementation to manage complex 5D meshes. It supports advanced features like Expert Parallelism (EP) and Context Parallelism (CP) within its mesh structure .

Archon Mesh Semantics

world_mesh: The global 5D device mesh encompassing all parallelism dimensions .
pp_stages: Pipeline stages managed through explicit PipelineStage objects .
Parallel Dimensions: Dimensions include dp_shard, tp, cp, ep, and pp .

Creation Flow: Archon Process Groups

Title: "Archon Mesh Initialization"

Sources: ,

FSDP Device Mesh Construction

FSDP uses ParallelHelper to create a 3D device mesh with dimensions [dp × sp × tp]. The ParallelHelper.parallelize_model() method utilizes this mesh structure .

Code Implementation

The FSDP engine creates its device mesh and process groups in create_process_group() . It uses a default timeout of DIST_GROUP_DEFAULT_TIMEOUT (7200s) to handle long generation phases .

Sources: ,

CPU Groups for Synchronization

All engines create a CPU-backed process group using the Gloo backend. This is essential for synchronization during model offloading when GPU communication may be unavailable or the GPU memory is being cleared.

Purpose and Usage

Why CPU groups are needed:

Offload mode barriers: When model parameters are offloaded to CPU, GPU-backed collectives cannot be used .
Cross-backend synchronization: Gloo works on CPU even when main communication uses NCCL/XCCL.
Reliable barriers: CPU barriers are more stable during memory-intensive operations.

Creation pattern (identical across all engines):

Sources: ,

Weight Update Process Groups

When connecting training engines to inference engines for asynchronous RL, additional process groups are created for weight synchronization, specifically for the nccl or xccl update mode .

Weight Update Coordination

Title: "Weight Update Synchronization Topology"

Key characteristics:

Custom Initialization: init_custom_process_group is used to create a separate communication domain between the trainer and remote inference backends . This function bypasses the standard dist.init_process_group to allow multiple main groups in the same process .
Warmup Mechanism: warmup_process_groups forces eager initialization of the collective communicator to avoid race conditions during training . It performs a dummy all_reduce to ensure all ranks are aligned .
Timeout Management: patch_dist_group_timeout ensures long-running updates don't time out by patching distributed_c10d.default_pg_timeout .

Sources: ,

Best Practices

Always Use Explicit Groups: Never rely on the global process group. Use self.dp_group or self._cpu_group explicitly .
Large Timeouts for Generation: Generation phases can be very long. AReaL uses a 7200s timeout by default .
DP Head Responsibilities: Use dp_head to ensure only one rank per replica performs I/O or logging tasks .
Group Warmup: Always call warmup_process_groups after creating custom groups to prevent HCCL/NCCL initialization errors on NPU/GPU .

Sources: , ,

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/8.5-device-mesh-and-process-groups