Last indexed: 7 May 2026 (2e12c1)

Multi-Node Training

Purpose and Scope

This page provides technical details and practical tutorials for setting up and running AReaL training experiments across multiple nodes. AReaL is designed as an asynchronous RL system that scales by distributing training and inference workers across a cluster, coordinated by a centralized scheduler. The system currently supports multi-node execution primarily through the RayScheduler areal/infra/scheduler/ray.py58-65 SlurmScheduler areal/infra/scheduler/slurm.py72-89 and LocalScheduler (for single-node multi-process testing) areal/infra/scheduler/local.py92-98

For details on the abstract Scheduler API and its implementations, see 9.1 Scheduler API For configuration syntax and parameters, see 2.1 Configuration Overview For information on parallelism strategies within engines, see 8.1 Parallelism Overview

Prerequisites

Hardware Requirements

Multi-node distributed training requires a robust infrastructure to handle high-bandwidth communication and large-scale model states.

Component	Requirement	Notes
GPUs per node	1-8+ GPUs	Multi-node examples typically utilize 8 GPUs per node.
CPU	8-32+ cores per node	Higher core counts (32+) are recommended for multi-node efficiency.
Memory	32GB-256GB+ per node	256GB+ is recommended for large-scale experiments.
Network	High-bandwidth interconnect	NVSwitch + RoCE or Infiniband is recommended for NCCL performance.
Storage	Shared storage	Required for multi-node weight synchronization, logging, and checkpoints.

Software Requirements

Component	Version	Purpose
Docker Image	`v1.0.3-sglang` or `v1.0.3-vllm`	Pre-configured environment with all CUDA extras and AReaL runtime.
Ray	2.x	Distributed task orchestration and remote worker management areal/infra/scheduler/ray.py9-11
SkyPilot	Latest	Cloud-agnostic deployment and cluster provisioning.
Torch	2.x	Base deep learning framework.

Shared Storage Requirement

Critical: All nodes in a multi-node setup must have access to a shared filesystem path. This path is used for:

Checkpoint persistence and recovery.
Training logs and performance traces.
Dataset storage for distributed data loading.
Weight synchronization between training and inference engines.

The LocalScheduler and SlurmScheduler validate these paths using validate_shared_path areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126

Sources: areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126 areal/infra/scheduler/ray.py58-84

Multi-Node Architecture Overview

The following diagram bridges the high-level multi-node concepts with the specific code entities involved in orchestration.

Multi-Node Distributed Training Architecture

This diagram illustrates how AReaL distributes computation across nodes. The RayScheduler (requested via scheduler.type=ray) or SlurmScheduler (via scheduler.type=slurm) provisions workers across the cluster. Training engines like FSDPEngine form NCCL collectives for gradient synchronization, while all workers interact with a common fileroot mount for persistent data.

Sources: areal/api/scheduler_api.py43-49 areal/infra/scheduler/ray.py58-65 areal/infra/scheduler/slurm.py72-89 areal/infra/rpc/rpc_server.py3-13

Multi-Node Training with Ray

Ray is a primary scheduler for multi-node training, providing automatic resource management and remote object invocation through RayRPCServer areal/infra/rpc/ray_rpc_server.py

Ray Scheduler Implementation

The RayScheduler manages RayWorkerInfo which contains the worker's id, the Ray ActorHandle, and the associated PlacementGroup areal/infra/scheduler/ray.py48-55

Key Functions:

create_workers(): Creates Ray actors based on SchedulingSpec areal/infra/scheduler/ray.py167-175
create_engine(): Instantiates engine classes (e.g., FSDPEngine) on remote Ray actors via RPC areal/infra/scheduler/ray.py228-235
_get_placement_strategy(): Determines how to allocate GPUs using strategies like SharedRayPlacementStrategy or SeparatedRayPlacementStrategy areal/infra/scheduler/ray.py145-165

Configuring AReaL for Ray

Specify scheduler.type=ray in your configuration. The cluster.n_nodes and cluster.n_gpus_per_node parameters define the total resource pool.

Sources: areal/infra/scheduler/ray.py58-84 areal/api/cli_args.py8-10

Multi-Node Training with Slurm

The SlurmScheduler allows AReaL to integrate with HPC clusters. It uses srun to launch SyncRPCServer processes across allocated nodes areal/infra/scheduler/slurm.py72-89

Slurm Worker Management

The scheduler tracks SlurmWorkerInfo, which includes the slurm_job_id and the specific node where the worker is running areal/infra/scheduler/slurm.py59-69

Key Functions:

create_workers(): Generates sbatch scripts and submits jobs to the Slurm queue areal/infra/scheduler/slurm.py235-245
get_workers(): Discovers worker IP addresses and ports by reading Slurm logs or querying job metadata areal/infra/scheduler/slurm.py410-420
delete_workers(): Cancels Slurm jobs using scancel areal/infra/scheduler/slurm.py560-570

Sources: areal/infra/scheduler/slurm.py59-89 areal/infra/utils/slurm.py46-50

Worker Startup and RPC Flow

The following sequence diagram maps the startup process to the internal code flow of the Scheduler and RPCServer.

Worker and Engine Initialization Flow

The flow begins with the Trainer requesting workers. The scheduler launches SyncRPCServer (for Slurm/Local) or RayRPCServer (for Ray) areal/infra/rpc/rpc_server.py33-62 Once the RPC server is running, the scheduler calls create_engine to instantiate the actual training or inference logic on the remote device areal/api/scheduler_api.py182-189

Sources: areal/infra/rpc/rpc_server.py33-62 areal/api/scheduler_api.py182-189 areal/infra/scheduler/local.py336-345

Multi-Node Configuration Examples

GRPO Multi-Node with Megatron Backend

For large models requiring tensor parallelism across nodes, the megatron backend is often used with a multi-node scheduler examples/math/gsm8k_grpo_megatron.yaml43-52

Camel Agentic RL Multi-Node

Agentic workflows using CamelRLVRWorkflow can scale rollouts across multiple nodes by configuring the rollout and actor backends to use the cluster resources examples/camel/train.py68-74 examples/camel/config.yaml11-14

Sources: examples/math/gsm8k_grpo_megatron.yaml9-12 examples/camel/config.yaml11-14 examples/camel/train.py106-120

Best Practices

Shared Filesystem Validation: Ensure fileroot is accessible from all nodes. The scheduler will check this during initialization areal/infra/scheduler/local.py138-140
Name Resolution: Use type: nfs or type: etcd3 for name_resolve in multi-node settings to allow workers to discover each other's network locations areal/infra/scheduler/local.py130-134
Port Allocation: In multi-node environments, the scheduler handles finding free ports on remote nodes to avoid conflicts areal/utils/network.py50-54
Graceful Teardown: Use reverse_order=True in delete_workers() to ensure rank-0 (often the TCPStore host) is the last to terminate, preventing NCCL heartbeat warnings areal/api/scheduler_api.py122-131
Environment Variables: Pass critical cluster variables (like HCCL_CONNECT_TIMEOUT for NPU clusters) via the env_vars field in SchedulingSpec examples/math/boba_grpo.yaml88-98

Sources: areal/infra/scheduler/local.py130-144 areal/api/scheduler_api.py122-131 examples/math/boba_grpo.yaml88-98 areal/utils/network.py50-54

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.5-multi-node-training

⇱ Multi-Node Training | inclusionAI/AReaL | DeepWiki