Last indexed: 7 May 2026 (2e12c1)

Scheduler and Job Management

Purpose and Scope

This document describes AReaL's scheduler abstraction and distributed job management system. The scheduler is responsible for provisioning worker processes across execution environments (local machines, Ray clusters, Slurm HPC systems), creating and managing engine instances on those workers, and facilitating remote procedure calls (RPC) between the control plane and data plane.

The scheduler layer decouples AReaL's high-level training orchestration from the underlying execution environment, enabling the same training code to run on a laptop, a Ray cluster, or an HPC system with minimal configuration changes. For configuration of resource allocation patterns, see allocation_mode Syntax. For details on training and inference engines that run on workers, see Training Engines and Inference System.

Sources: areal/api/scheduler_api.py43-49 areal/api/scheduler_api.py51-220

Scheduler Architecture Overview

The scheduler follows a control plane / data plane separation. The control plane runs on the driver process and manages job orchestration, while the data plane consists of worker processes that host engine instances performing actual computation. A central component of the data plane is the Guard, which handles process management and port allocation.

System Components to Code Entities

The following diagram maps high-level system components to their specific implementations and identifiers in the codebase, specifically highlighting the role of the Guard and SyncRPCServer.

Sources: areal/api/scheduler_api.py43-49 areal/infra/rpc/rpc_server.py33-62 areal/infra/rpc/guard/app.py39-109 areal/infra/rpc/guard/engine_blueprint.py92-106 areal/infra/controller/train_controller.py150-153 areal/infra/controller/rollout_controller.py185-190

Worker Lifecycle and RPC Mapping

The scheduler interacts with workers through standardized API calls that map to internal RPC server routes. The SyncRPCServer utilizes an internal _engine_thread to ensure serial execution of engine operations, which is critical for NCCL stability.

Sources: areal/api/scheduler_api.py51-210 areal/infra/rpc/guard/app.py154-180 areal/infra/rpc/guard/engine_blueprint.py101-168 areal/infra/rpc/serialization.py1-30

Core Data Structures

Worker

The Worker dataclass represents a worker process in the distributed system. Each worker has a unique identifier, IP address, and allocated port numbers for communication.

Field	Type	Description
`id`	`str`	Unique identifier (e.g., `"rollout/0"`, `"actor/1"`)
`ip`	`str`	IP address where worker is running
`worker_ports`	`list[str]`	Port numbers for worker-level communication
`engine_ports`	`list[str]`	Port numbers for engine-level communication

Sources: areal/api/scheduler_api.py14-32

Job

The Job dataclass specifies how to create a group of workers for a specific role.

Field	Type	Description
`role`	`str`	Role name (e.g., `"rollout"`, `"actor"`)
`replicas`	`int`	Number of worker replicas to create
`tasks`	`list[SchedulingSpec]`	Resource requirements (GPU, CPU, Memory) per task
`scheduling_strategy`	`SchedulingStrategy`	Placement strategy (e.g., colocation)

Sources: areal/api/scheduler_api.py35-40

Scheduler Implementations

The Scheduler abstract base class defines the contract for managing distributed worker processes.

Local Launcher

LocalScheduler manages worker subprocesses on a single GPU node. It handles round-robin GPU assignment via _allocate_gpus and monitors process health. It validates shared paths for fileroot and name_resolve before starting. For details, see Local Launcher.

Sources: areal/infra/scheduler/local.py92-171 areal/infra/scheduler/local.py199-215

Ray Launcher

RayScheduler uses Ray placement groups to manage resources across a cluster. It supports different placement strategies (Shared, Separated, Deferred) via _get_placement_strategy and uses RayRPCServer to host engines within Ray actors. For details, see Ray Launcher.

Sources: areal/infra/scheduler/ray.py58-71 areal/infra/scheduler/ray.py145-166 areal/infra/scheduler/ray.py167-200

SLURM Launcher

SlurmScheduler integrates with HPC clusters using sbatch and srun. It tracks job states using query_jobs and handles worker discovery via name_resolve. It supports containerized execution via Apptainer or Docker. For details, see SLURM Launcher.

Sources: areal/infra/scheduler/slurm.py72-156 areal/infra/scheduler/slurm.py135-140

Engine RPC System

AReaL uses a custom RPC system built on Flask/Werkzeug to execute methods on remote engines. This is managed by the Guard and associated blueprints.

The Guard Process

The Guard is a base process management layer. It maintains a GuardState that tracks allocated ports and forked child processes.

GuardState: Thread-safe storage for server identity, worker identity, and process handles. areal/infra/rpc/guard/app.py39-84
create_app: Factory for the Flask application providing /health, /alloc_ports, and /fork endpoints. areal/infra/rpc/guard/app.py154-180

Engine Creation and Execution

The engine_bp blueprint extends the Guard to support engine lifecycles:

create_engine: Dynamically imports and instantiates engine classes on the worker. areal/infra/rpc/guard/engine_blueprint.py56-61
call: Routes remote method calls to the local engine instance via the _engine_thread. areal/infra/rpc/guard/engine_blueprint.py63-69

Sources: areal/infra/rpc/rpc_server.py33-62 areal/infra/rpc/guard/app.py19-25 areal/infra/rpc/guard/engine_blueprint.py92-106

Shared Storage Validation

Distributed training requires all nodes to access the same checkpoints and logs. The validate_shared_path utility ensures that the provided paths (like fileroot) are on a recognized network filesystem to prevent data inconsistency.

Network Filesystem Detection

The system checks psutil.disk_partitions() against a list of known keywords defined in NETWORK_FS_KEYWORDS:

Standard: nfs, lustre, gpfs, beegfs, ceph, glusterfs
Cloud: efs (AWS), gcsfuse (GCP), cpfs (Alibaba), vepfs (ByteDance), obsfs (Huawei)
Distributed: juicefs, alluxio, seaweedfs, hdfs

Sources: areal/utils/fs.py10-48 areal/utils/fs.py58-111

Summary of Child Pages

Scheduler API: Defines the Worker, Job, and Scheduler base interfaces. areal/api/scheduler_api.py1-220
Worker and Job Management: Details on SchedulingSpec, SchedulingStrategy, and the worker lifecycle. areal/api/scheduler_api.py14-40
Engine RPC System: Deep dive into rpc_server.py, the Guard, and the RPC protocol. areal/infra/rpc/guard/app.py39-109
Local Launcher: Implementation details for LocalScheduler and subprocess management. areal/infra/scheduler/local.py92-171
Ray Launcher: Integration with Ray placement groups and RayRPCServer. areal/infra/scheduler/ray.py58-192
SLURM Launcher: Managing HPC jobs with sbatch templates and srun. areal/infra/scheduler/slurm.py72-156
SkyPilot Integration: Cloud-agnostic deployment using SkyPilot for automated cluster provisioning.
Shared Storage Validation: Filesystem checks for distributed consistency using validate_shared_path. areal/utils/fs.py58-111

Sources: areal/api/scheduler_api.py1-251 areal/infra/scheduler/local.py1-206 areal/infra/scheduler/slurm.py1-156 areal/infra/scheduler/ray.py1-200 areal/utils/fs.py1-111

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/9-scheduler-and-job-management