VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.2-worker-and-job-management

⇱ Worker and Job Management | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Worker and Job Management

Purpose and Scope

This document describes the worker and job management system in AReaL, which provides the abstraction layer for creating, managing, and destroying distributed worker processes across different platforms (Ray, Slurm, local). Workers are the fundamental execution units that host engine instances (training or inference engines). The job specification defines resource requirements and replication strategies for creating groups of workers.

The system is built around the Scheduler abstract base class, which defines the lifecycle contract for all backends areal/api/scheduler_api.py43-49


Job Specification

A Job defines the requirements for creating a group of workers with a specific role. Jobs are submitted to schedulers via the create_workers() method areal/api/scheduler_api.py58-82

Job Structure

The Job dataclass specifies:

FieldTypeDescription
rolestrUnique identifier for this worker group (e.g., "rollout", "actor")
replicasintNumber of worker instances to create
taskslist[SchedulingSpec]Resource requirements per worker
scheduling_strategySchedulingStrategyPlacement strategy (separation, colocation)

Sources: areal/api/scheduler_api.py36-41

Scheduling Specifications

Each worker's resource requirements are defined by a SchedulingSpec. This includes hardware constraints (CPU, GPU, memory) and the command to execute (typically the RPC server) areal/api/scheduler_api.py7-10 For Ray-based execution, it also specifies a ray_placement_strategy (e.g., "deferred", "separate", "shared") to control how bundles are packed onto nodes areal/infra/scheduler/ray.py145-166

Sources: areal/api/scheduler_api.py39 areal/infra/scheduler/ray.py145-166

Scheduling Strategies

The SchedulingStrategy determines worker placement and resource sharing:

Strategy TypeDescriptionUse Case
separationEach worker gets independent resources.Default for rollout/training workers.
colocationWorkers share resources with a target role.Reference model sharing GPU with actor.

Sources: areal/api/scheduler_api.py40-41 areal/infra/scheduler/local.py177-178


Worker Identification

Workers are identified by a hierarchical naming scheme that enables systematic organization and lookup across the distributed cluster.

Worker ID Format

Worker IDs follow the pattern: "{role}/{rank}"

  • Role: The job's role name (e.g., "rollout", "actor", "ref")
  • Rank: Zero-indexed integer within the role (e.g., 0, 1, 2)

Sources: areal/api/scheduler_api.py14-33 areal/api/scheduler_api.py197-202

Engine Naming Convention

Each worker can host multiple engine instances (for colocation scenarios). Engines are typically named using the same role/rank convention, allowing the RPC system to route method calls to the correct engine instance on a worker areal/api/scheduler_api.py201-204


Worker and Engine Colocation Mapping

Sources: areal/api/scheduler_api.py182-209 areal/infra/rpc/rpc_server.py50-56

Worker Data Structure

The Worker dataclass stores network and port information required for RPC communication:


Sources: areal/api/scheduler_api.py14-33


Worker Lifecycle Management

Workers undergo several lifecycle stages from creation to termination, managed by the Scheduler implementation.

Worker Lifecycle State Machine


Worker Lifecycle State Machine

Sources: areal/api/scheduler_api.py58-115 areal/infra/scheduler/local.py92-98

Implementation Details: Local vs Slurm vs Ray

Sources: areal/infra/scheduler/local.py92-98 areal/infra/scheduler/slurm.py72-89 areal/infra/scheduler/ray.py58-77


Worker Forking

Worker forking is a mechanism where a new worker process is spawned from an existing "parent" worker's node. This is primarily used for colocation where secondary services (like a proxy server) need to share the same hardware environment as the primary engine areal/api/scheduler_api.py145-150

Guard and Forking Architecture

The Guard process is a Flask-based management layer that handles port allocation and child process forking areal/infra/rpc/rpc_server.py19-27


Worker Forking Architecture

Sources: areal/infra/rpc/rpc_server.py50-56 areal/api/scheduler_api.py145-180

Fork Implementation

Forking is triggered via the fork_workers() method. In the LocalScheduler, it identifies the target worker and sends a request to its Guard to spawn the new process areal/infra/scheduler/local.py177-178

Sources: areal/infra/scheduler/local.py177-178 areal/api/scheduler_api.py145-180


Resource Management

Port Allocation

Workers are allocated ports dynamically to avoid conflicts:

  1. worker_ports: Used for internal communication (e.g., Ray master IP/port) areal/api/scheduler_api.py31
  2. engine_ports: Used for engine-level communication areal/api/scheduler_api.py32

Shared Storage Validation

For distributed training, workers must access shared filesystems. The schedulers validate paths like fileroot and nfs_record_root using validate_shared_path to ensure they are consistent across the cluster areal/infra/scheduler/local.py138-145

Sources: areal/api/scheduler_api.py31-32 areal/infra/scheduler/local.py138-145


Summary of Key Entities

Class/EntityFileRole
Workerareal/api/scheduler_api.py14Data structure representing a remote process.
Jobareal/api/scheduler_api.py36Definition of a group of workers to be created.
Schedulerareal/api/scheduler_api.py43Abstract base class for all scheduler implementations.
LocalSchedulerareal/infra/scheduler/local.py92Subprocess-based scheduler for single-node execution.
RaySchedulerareal/infra/scheduler/ray.py58Ray-based scheduler for multi-node clusters.
SlurmSchedulerareal/infra/scheduler/slurm.py72Slurm-based scheduler for HPC clusters.

Sources: areal/api/scheduler_api.py areal/infra/scheduler/local.py areal/infra/scheduler/ray.py areal/infra/scheduler/slurm.py