Last indexed: 7 May 2026 (2e12c1)

Scheduler API

Purpose and Scope

The Scheduler API defines the abstract interface for managing distributed workers and engines across different cluster backends. It provides a unified abstraction for launching, coordinating, and communicating with training and inference engines regardless of whether they run on a single node, Ray cluster, SLURM HPC system, or cloud infrastructure. areal/api/scheduler_api.py43-49

This page covers the abstract Scheduler interface and its core methods. For implementation details of specific backends, see:

9.4 Local Launcher for single-node execution via LocalScheduler. areal/infra/scheduler/local.py92-98
9.5 Ray Launcher for Ray-based multi-node clusters via RayScheduler. areal/infra/scheduler/ray.py58-64
9.6 SLURM Launcher for HPC cluster integration via SlurmScheduler. areal/infra/scheduler/slurm.py72-76

For worker lifecycle and job specifications, see 9.2 Worker and Job Management. For remote method invocation patterns, see 9.3 Engine RPC System.

Scheduler in System Architecture

The Scheduler sits at the core of AReaL's distributed infrastructure, acting as the bridge between high-level training orchestration and low-level resource management. It is responsible for the lifecycle of Worker processes and the instantiation of TrainEngine or InferenceEngine components on those workers. areal/api/scheduler_api.py43-49

System Component Mapping

Sources: areal/api/scheduler_api.py14-49 areal/infra/scheduler/local.py92-98 areal/infra/scheduler/ray.py58-64 areal/infra/scheduler/slurm.py72-76 areal/infra/rpc/rpc_server.py3-13 areal/infra/controller/train_controller.py22

Core Scheduler Interface

The abstract Scheduler interface defines the contract that all implementations must fulfill to enable backend-agnostic distributed execution. areal/api/scheduler_api.py43-49

Primary Methods

`create_workers(job: Job) -> list[str]`

Spawns a group of worker processes according to the Job specification. areal/api/scheduler_api.py57-82

job: A Job object specifying the role (e.g., "rollout", "actor"), number of replicas, and resource requirements. areal/api/scheduler_api.py36-41
Returns: A list of worker IDs (e.g., ["rollout/0", "rollout/1"]). areal/api/scheduler_api.py72-73

`get_workers(role: str) -> list[Worker]`

Blocks until all workers for a specified role are ready and returns their metadata. areal/api/scheduler_api.py84-112

Worker Metadata: Includes id, ip, worker_ports, and engine_ports. areal/api/scheduler_api.py14-33

`create_engine(worker_id, engine, ...)`

Instantiates a training or inference engine on a remote worker via dynamic import. This method communicates with the remote SyncRPCServer or RayRPCServer to load the class specified by the import path. areal/api/scheduler_api.py182-214

engine: A string import path (e.g., "areal.engine.fsdp_engine.FSDPPPOActor"). areal/api/scheduler_api.py199-200
engine_name: Optional unique name to allow multiple engines per worker (colocation). areal/api/scheduler_api.py201-204

`fork_workers(role, target_role, command)`

Creates new worker processes by forking from existing workers. This is used to colocate processes (like an OpenAI proxy) on the same nodes as existing workers, sharing the same resource environment (e.g., CUDA_VISIBLE_DEVICES). areal/api/scheduler_api.py145-180

`set_worker_env(worker_id, env)`

Sets environment variables on a specific worker before engine creation. This is critical for setting backend-specific variables like NCCL_IB_DISABLE or OMP_NUM_THREADS. areal/api/scheduler_api.py216-227

Sources: areal/api/scheduler_api.py57-227 areal/infra/rpc/rpc_server.py50-56

Job and Worker Abstractions

The Scheduler manages two primary data structures: Job and Worker.

Data Flow: Job to Worker Mapping

Job

A Job represents a logical group of workers performing the same role. areal/api/scheduler_api.py36-41

role: Identifier for the group (e.g., "actor", "rollout"). areal/api/scheduler_api.py37
replicas: Number of identical worker instances to spawn. areal/api/scheduler_api.py38
tasks: A list of SchedulingSpec objects defining specific resource requirements (GPUs, memory, CPU). areal/api/scheduler_api.py39

Worker

A Worker represents a physical process or container in the cluster. areal/api/scheduler_api.py14-33

id: Unique string identifier (e.g., "rollout/0"). areal/api/scheduler_api.py29
ip: IP address for network communication. areal/api/scheduler_api.py30
worker_ports: Ports for internal worker management (e.g., RPC server port). areal/api/scheduler_api.py31
engine_ports: Ports dedicated to engine-to-engine communication (e.g., NCCL, distributed init). areal/api/scheduler_api.py32

Sources: areal/api/scheduler_api.py14-41 areal/infra/scheduler/local.py203-210 areal/infra/scheduler/ray.py145-166

Shared Storage Validation

Distributed training requires a shared filesystem for weight synchronization and checkpointing. Schedulers like LocalScheduler and SlurmScheduler utilize utility functions to ensure the environment is correctly configured before starting jobs. areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126

The validate_shared_path function checks if a given path is located on a network filesystem (NFS, Lustre, Ceph, etc.) to prevent data inconsistency across nodes. areal/infra/scheduler/local.py49

Scheduler	Configuration Parameter	Validation Logic
LocalScheduler	`fileroot`, `name_resolve_config.nfs_record_root`	Checks path exists and is shared areal/infra/scheduler/local.py138-144
SlurmScheduler	`fileroot`, `name_resolve_config.nfs_record_root`	Checks path exists and is shared areal/infra/scheduler/slurm.py120-126
RayScheduler	`exp_config.cluster.fileroot`	Implicitly handled by Ray environment areal/infra/scheduler/ray.py66-72

Sources: areal/infra/scheduler/local.py49 areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126 areal/infra/scheduler/ray.py66-72

Relationship to Orchestration

Orchestrators consume the Scheduler API to build the distributed training environment. The typical flow involves creating workers, setting their environment, and then instantiating engines via RPC. areal/api/scheduler_api.py182-227

Sources: areal/api/scheduler_api.py182-227 areal/infra/rpc/rpc_server.py50-62 areal/infra/scheduler/ray.py104-129 areal/infra/controller/train_controller.py189-209

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/9.1-scheduler-api

⇱ Scheduler API | inclusionAI/AReaL | DeepWiki