VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.1-scheduler-api

⇱ Scheduler API | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Scheduler API

Purpose and Scope

The Scheduler API defines the abstract interface for managing distributed workers and engines across different cluster backends. It provides a unified abstraction for launching, coordinating, and communicating with training and inference engines regardless of whether they run on a single node, Ray cluster, SLURM HPC system, or cloud infrastructure. areal/api/scheduler_api.py43-49

This page covers the abstract Scheduler interface and its core methods. For implementation details of specific backends, see:

For worker lifecycle and job specifications, see 9.2 Worker and Job Management. For remote method invocation patterns, see 9.3 Engine RPC System.

Scheduler in System Architecture

The Scheduler sits at the core of AReaL's distributed infrastructure, acting as the bridge between high-level training orchestration and low-level resource management. It is responsible for the lifecycle of Worker processes and the instantiation of TrainEngine or InferenceEngine components on those workers. areal/api/scheduler_api.py43-49

System Component Mapping


Sources: areal/api/scheduler_api.py14-49 areal/infra/scheduler/local.py92-98 areal/infra/scheduler/ray.py58-64 areal/infra/scheduler/slurm.py72-76 areal/infra/rpc/rpc_server.py3-13 areal/infra/controller/train_controller.py22

Core Scheduler Interface

The abstract Scheduler interface defines the contract that all implementations must fulfill to enable backend-agnostic distributed execution. areal/api/scheduler_api.py43-49

Primary Methods

create_workers(job: Job) -> list[str]

Spawns a group of worker processes according to the Job specification. areal/api/scheduler_api.py57-82

get_workers(role: str) -> list[Worker]

Blocks until all workers for a specified role are ready and returns their metadata. areal/api/scheduler_api.py84-112

create_engine(worker_id, engine, ...)

Instantiates a training or inference engine on a remote worker via dynamic import. This method communicates with the remote SyncRPCServer or RayRPCServer to load the class specified by the import path. areal/api/scheduler_api.py182-214

fork_workers(role, target_role, command)

Creates new worker processes by forking from existing workers. This is used to colocate processes (like an OpenAI proxy) on the same nodes as existing workers, sharing the same resource environment (e.g., CUDA_VISIBLE_DEVICES). areal/api/scheduler_api.py145-180

set_worker_env(worker_id, env)

Sets environment variables on a specific worker before engine creation. This is critical for setting backend-specific variables like NCCL_IB_DISABLE or OMP_NUM_THREADS. areal/api/scheduler_api.py216-227

Sources: areal/api/scheduler_api.py57-227 areal/infra/rpc/rpc_server.py50-56

Job and Worker Abstractions

The Scheduler manages two primary data structures: Job and Worker.

Data Flow: Job to Worker Mapping


Job

A Job represents a logical group of workers performing the same role. areal/api/scheduler_api.py36-41

Worker

A Worker represents a physical process or container in the cluster. areal/api/scheduler_api.py14-33

Sources: areal/api/scheduler_api.py14-41 areal/infra/scheduler/local.py203-210 areal/infra/scheduler/ray.py145-166

Shared Storage Validation

Distributed training requires a shared filesystem for weight synchronization and checkpointing. Schedulers like LocalScheduler and SlurmScheduler utilize utility functions to ensure the environment is correctly configured before starting jobs. areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126

The validate_shared_path function checks if a given path is located on a network filesystem (NFS, Lustre, Ceph, etc.) to prevent data inconsistency across nodes. areal/infra/scheduler/local.py49

SchedulerConfiguration ParameterValidation Logic
LocalSchedulerfileroot, name_resolve_config.nfs_record_rootChecks path exists and is shared areal/infra/scheduler/local.py138-144
SlurmSchedulerfileroot, name_resolve_config.nfs_record_rootChecks path exists and is shared areal/infra/scheduler/slurm.py120-126
RaySchedulerexp_config.cluster.filerootImplicitly handled by Ray environment areal/infra/scheduler/ray.py66-72

Sources: areal/infra/scheduler/local.py49 areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126 areal/infra/scheduler/ray.py66-72

Relationship to Orchestration

Orchestrators consume the Scheduler API to build the distributed training environment. The typical flow involves creating workers, setting their environment, and then instantiating engines via RPC. areal/api/scheduler_api.py182-227


Sources: areal/api/scheduler_api.py182-227 areal/infra/rpc/rpc_server.py50-62 areal/infra/scheduler/ray.py104-129 areal/infra/controller/train_controller.py189-209