Last indexed: 7 May 2026 (2e12c1)

SLURM Launcher

The SLURM Launcher provides a Scheduler implementation for executing AReaL workloads on High-Performance Computing (HPC) clusters managed by SLURM (Simple Linux Utility for Resource Management). It enables distributed training and inference across multiple compute nodes by translating AReaL's resource allocation specifications into SLURM job submissions and managing worker processes within allocated nodes.

Scope: This document covers the SlurmScheduler implementation, job submission mechanics, and multi-node coordination. It details how the system interacts with the abstract Scheduler interface defined in areal/api/scheduler_api.py43-49

Purpose and Architecture

The SLURM Launcher bridges AReaL's scheduler abstraction with SLURM's resource management system. It translates high-level resource requests (defined via SchedulingSpec areal/api/cli_args.py255-274) into SLURM sbatch scripts, manages process spawning across allocated nodes via srun, and coordinates distributed training with proper environment configuration.

Key Responsibilities:

Submit SLURM jobs with appropriate resource requests (GPUs, CPUs, nodes) areal/infra/scheduler/slurm.py284-332
Parse SLURM environment variables to discover allocated nodes and resources areal/infra/scheduler/slurm.py448-472
Create and manage SlurmWorkerInfo objects areal/infra/scheduler/slurm.py59-69 across multiple nodes.
Configure distributed communication (NCCL, Gloo) with correct node rankings areal/infra/scheduler/slurm.py348-369
Validate shared filesystem access across all nodes using validate_shared_path areal/utils/fs.py58-69
Handle job lifecycle including health checks and cancellation areal/infra/scheduler/slurm.py511-545

Sources: areal/api/scheduler_api.py43-49 areal/infra/scheduler/slurm.py72-90 areal/utils/fs.py58-69

SLURM Scheduler Implementation

The SlurmScheduler areal/infra/scheduler/slurm.py72-159 manages the lifecycle of distributed jobs. It uses the Job dataclass areal/api/scheduler_api.py36-41 to encapsulate the requirements for a specific role (e.g., "actor" or "rollout").

Code Entity Mapping: Scheduler to SLURM

The following diagram maps the abstract interface to the functional components in the SLURM implementation.

Scheduler to SLURM Mapping

Sources: areal/api/scheduler_api.py11-47 areal/infra/scheduler/slurm.py59-69 areal/infra/scheduler/slurm.py284-332

Job Submission and Resource Allocation

Resource Request Translation

The scheduler processes a Job areal/api/scheduler_api.py36-41 which contains a list of SchedulingSpec tasks. Each task defines the GPU and CPU requirements. These are translated into #SBATCH directives within _generate_sbatch_script areal/infra/scheduler/slurm.py214-282

SLURM Directive	Purpose	Source Mapping
`#SBATCH --nodes`	Total physical nodes	`job.replicas` / `n_gpus_per_node` areal/infra/scheduler/slurm.py228
`#SBATCH --gres`	GPU resources per node	`spec.gpu` areal/infra/scheduler/slurm.py233
`#SBATCH --cpus-per-task`	CPU cores per worker	`spec.cpu` areal/infra/scheduler/slurm.py232
`#SBATCH --partition`	Cluster queue	`spec.strategy.partition` areal/infra/scheduler/slurm.py237

Sources: areal/api/scheduler_api.py36-41 areal/infra/scheduler/slurm.py214-282

Node Discovery and Worker Lifecycle

Once the SLURM job starts, the scheduler uses _discover_workers areal/infra/scheduler/slurm.py448-509 to identify the allocated nodes via scontrol show job and _parse_slurm_nodelist areal/infra/utils/slurm.py46-50

SLURM Node Discovery Flow

Sources: areal/infra/scheduler/slurm.py448-509 areal/infra/utils/slurm.py46-50

Distributed Communication Setup

The SlurmScheduler configures environment variables to allow torch.distributed and AReaL's RPC system to initialize correctly across nodes.

Required Environment Variables

Variable	Role	Logic
`MASTER_ADDR`	Rendezvous point	IP of the first node in the allocation areal/infra/scheduler/slurm.py356
`MASTER_PORT`	Rendezvous port	Dynamically selected port areal/infra/scheduler/slurm.py357
`WORLD_SIZE`	Total process count	Total number of replicas areal/infra/scheduler/slurm.py358
`RANK`	Global process ID	`SLURM_PROCID` mapped to worker index areal/infra/scheduler/slurm.py359

The scheduler uses get_env_vars areal/infra/utils/launcher.py40-44 and get_thread_env_vars areal/infra/utils/launcher.py45-46 to build the execution environment for srun.

Sources: areal/infra/scheduler/slurm.py348-369 areal/infra/utils/launcher.py40-46

Shared Filesystem and Storage

Distributed training on SLURM clusters necessitates a shared filesystem (e.g., NFS, Lustre) for weight synchronization and logging.

Validation: The scheduler calls validate_shared_path areal/utils/fs.py49 for the fileroot and nfs_record_root during initialization areal/infra/scheduler/slurm.py120-126
Name Resolution: By default, it uses nfs based name resolution areal/infra/scheduler/slurm.py86 which relies on file-based locking on the shared filesystem to coordinate worker IPs and ports.
Logging: Logs are directed to the shared fileroot areal/infra/scheduler/slurm.py168-176 allowing the master process to aggregate and stream logs from all workers via build_streaming_log_cmd areal/infra/utils/proc.py45-47

Sources: areal/infra/scheduler/slurm.py120-126 areal/utils/fs.py49 areal/infra/utils/proc.py45-47

RPC Server Integration

The SLURM workers execute areal.infra.rpc.rpc_server areal/infra/rpc/rpc_server.py1-66

Guard Architecture: Each worker runs a Flask-based RPC server with GuardState areal/infra/rpc/rpc_server.py50-53
Blueprint Registration: The server registers data_bp for data transfer and engine_bp for model execution areal/infra/rpc/rpc_server.py54-55
Engine Lifecycle: When create_engine is called by the scheduler areal/infra/scheduler/slurm.py552-602 the remote RPC server instantiates the requested engine (e.g., FSDPEngine) and manages its lifecycle.
Serial Execution: Engine operations are executed serially in a dedicated engine thread to ensure compatibility with collective communication backends like NCCL.

Sources: areal/infra/rpc/rpc_server.py1-66 areal/infra/scheduler/slurm.py552-602 areal/api/scheduler_api.py182-189

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/9.6-slurm-launcher

⇱ SLURM Launcher | inclusionAI/AReaL | DeepWiki