VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9-scheduler-and-job-management

⇱ Scheduler and Job Management | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Scheduler and Job Management

Purpose and Scope

This document describes AReaL's scheduler abstraction and distributed job management system. The scheduler is responsible for provisioning worker processes across execution environments (local machines, Ray clusters, Slurm HPC systems), creating and managing engine instances on those workers, and facilitating remote procedure calls (RPC) between the control plane and data plane.

The scheduler layer decouples AReaL's high-level training orchestration from the underlying execution environment, enabling the same training code to run on a laptop, a Ray cluster, or an HPC system with minimal configuration changes. For configuration of resource allocation patterns, see allocation_mode Syntax. For details on training and inference engines that run on workers, see Training Engines and Inference System.

Sources: areal/api/scheduler_api.py43-49 areal/api/scheduler_api.py51-220


Scheduler Architecture Overview

The scheduler follows a control plane / data plane separation. The control plane runs on the driver process and manages job orchestration, while the data plane consists of worker processes that host engine instances performing actual computation. A central component of the data plane is the Guard, which handles process management and port allocation.

System Components to Code Entities

The following diagram maps high-level system components to their specific implementations and identifiers in the codebase, specifically highlighting the role of the Guard and SyncRPCServer.


Sources: areal/api/scheduler_api.py43-49 areal/infra/rpc/rpc_server.py33-62 areal/infra/rpc/guard/app.py39-109 areal/infra/rpc/guard/engine_blueprint.py92-106 areal/infra/controller/train_controller.py150-153 areal/infra/controller/rollout_controller.py185-190

Worker Lifecycle and RPC Mapping

The scheduler interacts with workers through standardized API calls that map to internal RPC server routes. The SyncRPCServer utilizes an internal _engine_thread to ensure serial execution of engine operations, which is critical for NCCL stability.


Sources: areal/api/scheduler_api.py51-210 areal/infra/rpc/guard/app.py154-180 areal/infra/rpc/guard/engine_blueprint.py101-168 areal/infra/rpc/serialization.py1-30


Core Data Structures

Worker

The Worker dataclass represents a worker process in the distributed system. Each worker has a unique identifier, IP address, and allocated port numbers for communication.

FieldTypeDescription
idstrUnique identifier (e.g., "rollout/0", "actor/1")
ipstrIP address where worker is running
worker_portslist[str]Port numbers for worker-level communication
engine_portslist[str]Port numbers for engine-level communication

Sources: areal/api/scheduler_api.py14-32

Job

The Job dataclass specifies how to create a group of workers for a specific role.

FieldTypeDescription
rolestrRole name (e.g., "rollout", "actor")
replicasintNumber of worker replicas to create
taskslist[SchedulingSpec]Resource requirements (GPU, CPU, Memory) per task
scheduling_strategySchedulingStrategyPlacement strategy (e.g., colocation)

Sources: areal/api/scheduler_api.py35-40


Scheduler Implementations

The Scheduler abstract base class defines the contract for managing distributed worker processes.


Local Launcher

LocalScheduler manages worker subprocesses on a single GPU node. It handles round-robin GPU assignment via _allocate_gpus and monitors process health. It validates shared paths for fileroot and name_resolve before starting. For details, see Local Launcher.

Sources: areal/infra/scheduler/local.py92-171 areal/infra/scheduler/local.py199-215

Ray Launcher

RayScheduler uses Ray placement groups to manage resources across a cluster. It supports different placement strategies (Shared, Separated, Deferred) via _get_placement_strategy and uses RayRPCServer to host engines within Ray actors. For details, see Ray Launcher.

Sources: areal/infra/scheduler/ray.py58-71 areal/infra/scheduler/ray.py145-166 areal/infra/scheduler/ray.py167-200

SLURM Launcher

SlurmScheduler integrates with HPC clusters using sbatch and srun. It tracks job states using query_jobs and handles worker discovery via name_resolve. It supports containerized execution via Apptainer or Docker. For details, see SLURM Launcher.

Sources: areal/infra/scheduler/slurm.py72-156 areal/infra/scheduler/slurm.py135-140


Engine RPC System

AReaL uses a custom RPC system built on Flask/Werkzeug to execute methods on remote engines. This is managed by the Guard and associated blueprints.

The Guard Process

The Guard is a base process management layer. It maintains a GuardState that tracks allocated ports and forked child processes.

Engine Creation and Execution

The engine_bp blueprint extends the Guard to support engine lifecycles:

Sources: areal/infra/rpc/rpc_server.py33-62 areal/infra/rpc/guard/app.py19-25 areal/infra/rpc/guard/engine_blueprint.py92-106


Shared Storage Validation

Distributed training requires all nodes to access the same checkpoints and logs. The validate_shared_path utility ensures that the provided paths (like fileroot) are on a recognized network filesystem to prevent data inconsistency.

Network Filesystem Detection

The system checks psutil.disk_partitions() against a list of known keywords defined in NETWORK_FS_KEYWORDS:

  • Standard: nfs, lustre, gpfs, beegfs, ceph, glusterfs
  • Cloud: efs (AWS), gcsfuse (GCP), cpfs (Alibaba), vepfs (ByteDance), obsfs (Huawei)
  • Distributed: juicefs, alluxio, seaweedfs, hdfs

Sources: areal/utils/fs.py10-48 areal/utils/fs.py58-111


Summary of Child Pages

Sources: areal/api/scheduler_api.py1-251 areal/infra/scheduler/local.py1-206 areal/infra/scheduler/slurm.py1-156 areal/infra/scheduler/ray.py1-200 areal/utils/fs.py1-111