VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.6-slurm-launcher

⇱ SLURM Launcher | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

SLURM Launcher

The SLURM Launcher provides a Scheduler implementation for executing AReaL workloads on High-Performance Computing (HPC) clusters managed by SLURM (Simple Linux Utility for Resource Management). It enables distributed training and inference across multiple compute nodes by translating AReaL's resource allocation specifications into SLURM job submissions and managing worker processes within allocated nodes.

Scope: This document covers the SlurmScheduler implementation, job submission mechanics, and multi-node coordination. It details how the system interacts with the abstract Scheduler interface defined in areal/api/scheduler_api.py43-49

Purpose and Architecture

The SLURM Launcher bridges AReaL's scheduler abstraction with SLURM's resource management system. It translates high-level resource requests (defined via SchedulingSpec areal/api/cli_args.py255-274) into SLURM sbatch scripts, manages process spawning across allocated nodes via srun, and coordinates distributed training with proper environment configuration.

Key Responsibilities:

Sources: areal/api/scheduler_api.py43-49 areal/infra/scheduler/slurm.py72-90 areal/utils/fs.py58-69

SLURM Scheduler Implementation

The SlurmScheduler areal/infra/scheduler/slurm.py72-159 manages the lifecycle of distributed jobs. It uses the Job dataclass areal/api/scheduler_api.py36-41 to encapsulate the requirements for a specific role (e.g., "actor" or "rollout").

Code Entity Mapping: Scheduler to SLURM

The following diagram maps the abstract interface to the functional components in the SLURM implementation.

Scheduler to SLURM Mapping


Sources: areal/api/scheduler_api.py11-47 areal/infra/scheduler/slurm.py59-69 areal/infra/scheduler/slurm.py284-332

Job Submission and Resource Allocation

Resource Request Translation

The scheduler processes a Job areal/api/scheduler_api.py36-41 which contains a list of SchedulingSpec tasks. Each task defines the GPU and CPU requirements. These are translated into #SBATCH directives within _generate_sbatch_script areal/infra/scheduler/slurm.py214-282

SLURM DirectivePurposeSource Mapping
#SBATCH --nodesTotal physical nodesjob.replicas / n_gpus_per_node areal/infra/scheduler/slurm.py228
#SBATCH --gresGPU resources per nodespec.gpu areal/infra/scheduler/slurm.py233
#SBATCH --cpus-per-taskCPU cores per workerspec.cpu areal/infra/scheduler/slurm.py232
#SBATCH --partitionCluster queuespec.strategy.partition areal/infra/scheduler/slurm.py237

Sources: areal/api/scheduler_api.py36-41 areal/infra/scheduler/slurm.py214-282

Node Discovery and Worker Lifecycle

Once the SLURM job starts, the scheduler uses _discover_workers areal/infra/scheduler/slurm.py448-509 to identify the allocated nodes via scontrol show job and _parse_slurm_nodelist areal/infra/utils/slurm.py46-50

SLURM Node Discovery Flow


Sources: areal/infra/scheduler/slurm.py448-509 areal/infra/utils/slurm.py46-50

Distributed Communication Setup

The SlurmScheduler configures environment variables to allow torch.distributed and AReaL's RPC system to initialize correctly across nodes.

Required Environment Variables

VariableRoleLogic
MASTER_ADDRRendezvous pointIP of the first node in the allocation areal/infra/scheduler/slurm.py356
MASTER_PORTRendezvous portDynamically selected port areal/infra/scheduler/slurm.py357
WORLD_SIZETotal process countTotal number of replicas areal/infra/scheduler/slurm.py358
RANKGlobal process IDSLURM_PROCID mapped to worker index areal/infra/scheduler/slurm.py359

The scheduler uses get_env_vars areal/infra/utils/launcher.py40-44 and get_thread_env_vars areal/infra/utils/launcher.py45-46 to build the execution environment for srun.

Sources: areal/infra/scheduler/slurm.py348-369 areal/infra/utils/launcher.py40-46

Shared Filesystem and Storage

Distributed training on SLURM clusters necessitates a shared filesystem (e.g., NFS, Lustre) for weight synchronization and logging.

  1. Validation: The scheduler calls validate_shared_path areal/utils/fs.py49 for the fileroot and nfs_record_root during initialization areal/infra/scheduler/slurm.py120-126
  2. Name Resolution: By default, it uses nfs based name resolution areal/infra/scheduler/slurm.py86 which relies on file-based locking on the shared filesystem to coordinate worker IPs and ports.
  3. Logging: Logs are directed to the shared fileroot areal/infra/scheduler/slurm.py168-176 allowing the master process to aggregate and stream logs from all workers via build_streaming_log_cmd areal/infra/utils/proc.py45-47

Sources: areal/infra/scheduler/slurm.py120-126 areal/utils/fs.py49 areal/infra/utils/proc.py45-47

RPC Server Integration

The SLURM workers execute areal.infra.rpc.rpc_server areal/infra/rpc/rpc_server.py1-66

  • Guard Architecture: Each worker runs a Flask-based RPC server with GuardState areal/infra/rpc/rpc_server.py50-53
  • Blueprint Registration: The server registers data_bp for data transfer and engine_bp for model execution areal/infra/rpc/rpc_server.py54-55
  • Engine Lifecycle: When create_engine is called by the scheduler areal/infra/scheduler/slurm.py552-602 the remote RPC server instantiates the requested engine (e.g., FSDPEngine) and manages its lifecycle.
  • Serial Execution: Engine operations are executed serially in a dedicated engine thread to ensure compatibility with collective communication backends like NCCL.

Sources: areal/infra/rpc/rpc_server.py1-66 areal/infra/scheduler/slurm.py552-602 areal/api/scheduler_api.py182-189