Last indexed: 7 May 2026 (2e12c1)

Worker and Job Management

Purpose and Scope

This document describes the worker and job management system in AReaL, which provides the abstraction layer for creating, managing, and destroying distributed worker processes across different platforms (Ray, Slurm, local). Workers are the fundamental execution units that host engine instances (training or inference engines). The job specification defines resource requirements and replication strategies for creating groups of workers.

The system is built around the Scheduler abstract base class, which defines the lifecycle contract for all backends areal/api/scheduler_api.py43-49

Job Specification

A Job defines the requirements for creating a group of workers with a specific role. Jobs are submitted to schedulers via the create_workers() method areal/api/scheduler_api.py58-82

Job Structure

The Job dataclass specifies:

Field	Type	Description
`role`	`str`	Unique identifier for this worker group (e.g., `"rollout"`, `"actor"`)
`replicas`	`int`	Number of worker instances to create
`tasks`	`list[SchedulingSpec]`	Resource requirements per worker
`scheduling_strategy`	`SchedulingStrategy`	Placement strategy (separation, colocation)

Sources: areal/api/scheduler_api.py36-41

Scheduling Specifications

Each worker's resource requirements are defined by a SchedulingSpec. This includes hardware constraints (CPU, GPU, memory) and the command to execute (typically the RPC server) areal/api/scheduler_api.py7-10 For Ray-based execution, it also specifies a ray_placement_strategy (e.g., "deferred", "separate", "shared") to control how bundles are packed onto nodes areal/infra/scheduler/ray.py145-166

Sources: areal/api/scheduler_api.py39 areal/infra/scheduler/ray.py145-166

Scheduling Strategies

The SchedulingStrategy determines worker placement and resource sharing:

Strategy Type	Description	Use Case
`separation`	Each worker gets independent resources.	Default for rollout/training workers.
`colocation`	Workers share resources with a target role.	Reference model sharing GPU with actor.

Sources: areal/api/scheduler_api.py40-41 areal/infra/scheduler/local.py177-178

Worker Identification

Workers are identified by a hierarchical naming scheme that enables systematic organization and lookup across the distributed cluster.

Worker ID Format

Worker IDs follow the pattern: "{role}/{rank}"

Role: The job's role name (e.g., "rollout", "actor", "ref")
Rank: Zero-indexed integer within the role (e.g., 0, 1, 2)

Sources: areal/api/scheduler_api.py14-33 areal/api/scheduler_api.py197-202

Engine Naming Convention

Each worker can host multiple engine instances (for colocation scenarios). Engines are typically named using the same role/rank convention, allowing the RPC system to route method calls to the correct engine instance on a worker areal/api/scheduler_api.py201-204

Worker and Engine Colocation Mapping

Sources: areal/api/scheduler_api.py182-209 areal/infra/rpc/rpc_server.py50-56

Worker Data Structure

The Worker dataclass stores network and port information required for RPC communication:

Sources: areal/api/scheduler_api.py14-33

Worker Lifecycle Management

Workers undergo several lifecycle stages from creation to termination, managed by the Scheduler implementation.

Worker Lifecycle State Machine

Worker Lifecycle State Machine

Sources: areal/api/scheduler_api.py58-115 areal/infra/scheduler/local.py92-98

Implementation Details: Local vs Slurm vs Ray

LocalScheduler: Spawns processes using subprocess.Popen areal/infra/scheduler/local.py65 It performs round-robin GPU assignment using _allocate_gpus areal/infra/scheduler/local.py203-214 and manages ports via find_free_ports areal/utils/network.py51
SlurmScheduler: Generates sbatch scripts and uses srun for execution areal/infra/scheduler/slurm.py137-138 It tracks jobs via slurm_job_id and monitors status using query_jobs areal/infra/utils/slurm.py49
RayScheduler: Creates workers as Ray Actors using RayRPCServer areal/infra/scheduler/ray.py24 It uses Ray Placement Groups to ensure resource isolation and affinity areal/infra/scheduler/ray.py167-185

Sources: areal/infra/scheduler/local.py92-98 areal/infra/scheduler/slurm.py72-89 areal/infra/scheduler/ray.py58-77

Worker Forking

Worker forking is a mechanism where a new worker process is spawned from an existing "parent" worker's node. This is primarily used for colocation where secondary services (like a proxy server) need to share the same hardware environment as the primary engine areal/api/scheduler_api.py145-150

Guard and Forking Architecture

The Guard process is a Flask-based management layer that handles port allocation and child process forking areal/infra/rpc/rpc_server.py19-27

Worker Forking Architecture

Sources: areal/infra/rpc/rpc_server.py50-56 areal/api/scheduler_api.py145-180

Fork Implementation

Forking is triggered via the fork_workers() method. In the LocalScheduler, it identifies the target worker and sends a request to its Guard to spawn the new process areal/infra/scheduler/local.py177-178

Sources: areal/infra/scheduler/local.py177-178 areal/api/scheduler_api.py145-180

Resource Management

Port Allocation

Workers are allocated ports dynamically to avoid conflicts:

worker_ports: Used for internal communication (e.g., Ray master IP/port) areal/api/scheduler_api.py31
engine_ports: Used for engine-level communication areal/api/scheduler_api.py32

Shared Storage Validation

For distributed training, workers must access shared filesystems. The schedulers validate paths like fileroot and nfs_record_root using validate_shared_path to ensure they are consistent across the cluster areal/infra/scheduler/local.py138-145

Sources: areal/api/scheduler_api.py31-32 areal/infra/scheduler/local.py138-145

Summary of Key Entities

Class/Entity	File	Role
`Worker`	areal/api/scheduler_api.py14	Data structure representing a remote process.
`Job`	areal/api/scheduler_api.py36	Definition of a group of workers to be created.
`Scheduler`	areal/api/scheduler_api.py43	Abstract base class for all scheduler implementations.
`LocalScheduler`	areal/infra/scheduler/local.py92	Subprocess-based scheduler for single-node execution.
`RayScheduler`	areal/infra/scheduler/ray.py58	Ray-based scheduler for multi-node clusters.
`SlurmScheduler`	areal/infra/scheduler/slurm.py72	Slurm-based scheduler for HPC clusters.

Sources: areal/api/scheduler_api.py areal/infra/scheduler/local.py areal/infra/scheduler/ray.py areal/infra/scheduler/slurm.py

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/9.2-worker-and-job-management