VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/14.5-multi-node-training

⇱ Multi-Node Training | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Multi-Node Training

Purpose and Scope

This page provides technical details and practical tutorials for setting up and running AReaL training experiments across multiple nodes. AReaL is designed as an asynchronous RL system that scales by distributing training and inference workers across a cluster, coordinated by a centralized scheduler. The system currently supports multi-node execution primarily through the RayScheduler areal/infra/scheduler/ray.py58-65 SlurmScheduler areal/infra/scheduler/slurm.py72-89 and LocalScheduler (for single-node multi-process testing) areal/infra/scheduler/local.py92-98

For details on the abstract Scheduler API and its implementations, see 9.1 Scheduler API For configuration syntax and parameters, see 2.1 Configuration Overview For information on parallelism strategies within engines, see 8.1 Parallelism Overview


Prerequisites

Hardware Requirements

Multi-node distributed training requires a robust infrastructure to handle high-bandwidth communication and large-scale model states.

ComponentRequirementNotes
GPUs per node1-8+ GPUsMulti-node examples typically utilize 8 GPUs per node.
CPU8-32+ cores per nodeHigher core counts (32+) are recommended for multi-node efficiency.
Memory32GB-256GB+ per node256GB+ is recommended for large-scale experiments.
NetworkHigh-bandwidth interconnectNVSwitch + RoCE or Infiniband is recommended for NCCL performance.
StorageShared storageRequired for multi-node weight synchronization, logging, and checkpoints.

Software Requirements

ComponentVersionPurpose
Docker Imagev1.0.3-sglang or v1.0.3-vllmPre-configured environment with all CUDA extras and AReaL runtime.
Ray2.xDistributed task orchestration and remote worker management areal/infra/scheduler/ray.py9-11
SkyPilotLatestCloud-agnostic deployment and cluster provisioning.
Torch2.xBase deep learning framework.

Shared Storage Requirement

Critical: All nodes in a multi-node setup must have access to a shared filesystem path. This path is used for:

  • Checkpoint persistence and recovery.
  • Training logs and performance traces.
  • Dataset storage for distributed data loading.
  • Weight synchronization between training and inference engines.

The LocalScheduler and SlurmScheduler validate these paths using validate_shared_path areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126

Sources: areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126 areal/infra/scheduler/ray.py58-84


Multi-Node Architecture Overview

The following diagram bridges the high-level multi-node concepts with the specific code entities involved in orchestration.


Multi-Node Distributed Training Architecture

This diagram illustrates how AReaL distributes computation across nodes. The RayScheduler (requested via scheduler.type=ray) or SlurmScheduler (via scheduler.type=slurm) provisions workers across the cluster. Training engines like FSDPEngine form NCCL collectives for gradient synchronization, while all workers interact with a common fileroot mount for persistent data.

Sources: areal/api/scheduler_api.py43-49 areal/infra/scheduler/ray.py58-65 areal/infra/scheduler/slurm.py72-89 areal/infra/rpc/rpc_server.py3-13


Multi-Node Training with Ray

Ray is a primary scheduler for multi-node training, providing automatic resource management and remote object invocation through RayRPCServer areal/infra/rpc/ray_rpc_server.py

Ray Scheduler Implementation

The RayScheduler manages RayWorkerInfo which contains the worker's id, the Ray ActorHandle, and the associated PlacementGroup areal/infra/scheduler/ray.py48-55

Key Functions:

Configuring AReaL for Ray

Specify scheduler.type=ray in your configuration. The cluster.n_nodes and cluster.n_gpus_per_node parameters define the total resource pool.


Sources: areal/infra/scheduler/ray.py58-84 areal/api/cli_args.py8-10


Multi-Node Training with Slurm

The SlurmScheduler allows AReaL to integrate with HPC clusters. It uses srun to launch SyncRPCServer processes across allocated nodes areal/infra/scheduler/slurm.py72-89

Slurm Worker Management

The scheduler tracks SlurmWorkerInfo, which includes the slurm_job_id and the specific node where the worker is running areal/infra/scheduler/slurm.py59-69

Key Functions:

Sources: areal/infra/scheduler/slurm.py59-89 areal/infra/utils/slurm.py46-50


Worker Startup and RPC Flow

The following sequence diagram maps the startup process to the internal code flow of the Scheduler and RPCServer.


Worker and Engine Initialization Flow

The flow begins with the Trainer requesting workers. The scheduler launches SyncRPCServer (for Slurm/Local) or RayRPCServer (for Ray) areal/infra/rpc/rpc_server.py33-62 Once the RPC server is running, the scheduler calls create_engine to instantiate the actual training or inference logic on the remote device areal/api/scheduler_api.py182-189

Sources: areal/infra/rpc/rpc_server.py33-62 areal/api/scheduler_api.py182-189 areal/infra/scheduler/local.py336-345


Multi-Node Configuration Examples

GRPO Multi-Node with Megatron Backend

For large models requiring tensor parallelism across nodes, the megatron backend is often used with a multi-node scheduler examples/math/gsm8k_grpo_megatron.yaml43-52


Camel Agentic RL Multi-Node

Agentic workflows using CamelRLVRWorkflow can scale rollouts across multiple nodes by configuring the rollout and actor backends to use the cluster resources examples/camel/train.py68-74 examples/camel/config.yaml11-14


Sources: examples/math/gsm8k_grpo_megatron.yaml9-12 examples/camel/config.yaml11-14 examples/camel/train.py106-120


Best Practices

  1. Shared Filesystem Validation: Ensure fileroot is accessible from all nodes. The scheduler will check this during initialization areal/infra/scheduler/local.py138-140
  2. Name Resolution: Use type: nfs or type: etcd3 for name_resolve in multi-node settings to allow workers to discover each other's network locations areal/infra/scheduler/local.py130-134
  3. Port Allocation: In multi-node environments, the scheduler handles finding free ports on remote nodes to avoid conflicts areal/utils/network.py50-54
  4. Graceful Teardown: Use reverse_order=True in delete_workers() to ensure rank-0 (often the TCPStore host) is the last to terminate, preventing NCCL heartbeat warnings areal/api/scheduler_api.py122-131
  5. Environment Variables: Pass critical cluster variables (like HCCL_CONNECT_TIMEOUT for NPU clusters) via the env_vars field in SchedulingSpec examples/math/boba_grpo.yaml88-98

Sources: areal/infra/scheduler/local.py130-144 areal/api/scheduler_api.py122-131 examples/math/boba_grpo.yaml88-98 areal/utils/network.py50-54