VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.5-ray-launcher

⇱ Ray Launcher | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Ray Launcher

The Ray Launcher provides distributed worker management for AReaL training on Ray clusters. It implements the Scheduler interface using Ray's actor system and placement groups to orchestrate training and inference workers across multi-node GPU/NPU clusters.

For local single-node execution, see Local Launcher. For HPC cluster integration, see SLURM Launcher. For general scheduler concepts, see Scheduler API.

Sources: areal/infra/scheduler/ray.py58-77 areal/api/scheduler_api.py43-49


Architecture Overview

The RayScheduler manages worker processes as Ray actors, each running a RayRPCServer that hosts engine instances. Workers are organized into placement groups to control GPU allocation and enable colocation.

Ray Scheduler Component Mapping

The following diagram maps high-level system components to their corresponding code entities within the Ray infrastructure.


Sources: areal/infra/scheduler/ray.py48-56 areal/infra/scheduler/ray.py59-80 areal/api/scheduler_api.py14-33


Placement Strategies

Ray placement strategies determine how workers are allocated to GPUs and nodes. AReaL provides three strategies via SchedulingSpec.ray_placement_strategy.

Strategy Types

StrategyUse CaseImplementation ClassGPU Allocation
sharedTraining workersSharedRayPlacementStrategyMultiple workers share 1 PG; each gets 1 bundle
separateRollout workersSeparatedRayPlacementStrategy1 worker gets 1 exclusive PG
deferredRollout with local serversDeferredDeviceRayPlacementStrategyWorker requests 0 GPU; server subprocess claims GPU

Sources: areal/infra/utils/ray_placement_group.py118-235 areal/infra/scheduler/ray.py145-166


Worker Creation and Lifecycle

Creating Workers

The create_workers method spawns Ray actors based on a Job specification. It uses _prepare_worker_specs to validate the task list against the replica count.


Sources: areal/infra/scheduler/ray.py167-205 areal/infra/scheduler/ray.py85-102 areal/infra/rpc/ray_rpc_server.py23-35

Worker Lifecycle Methods

MethodPurposeKey Operations
create_workers(job)Spawn new workersCreate PG, spawn actors, ping, configure
get_workers(role, timeout)Retrieve worker listPing workers, return Worker objects
delete_workers(role)Clean up workersKill actors, remove PGs
fork_workers(role, target_role)Create colocated workersSpawn actors on existing PG

Sources: areal/infra/scheduler/ray.py167-205 areal/infra/scheduler/ray.py207-227 areal/infra/scheduler/ray.py229-277 areal/infra/scheduler/ray.py279-326


Colocation and Forking

AReaL supports two colocation modes for sharing GPUs between different worker roles:

Reuse Colocation (fork=False)

Reuses existing workers without spawning new actors. Multiple roles share the same worker processes. The scheduler resolves this by mapping the role to the target role in _colocated_roles.

Sources: areal/infra/scheduler/ray.py207-227

Fork Colocation (fork=True)

Spawns new actors on the same placement groups as the target role, enabling true multi-tenancy. This is often used for reference models or critics sharing a GPU with an actor.


Forked workers use a minimal GPU fraction (e.g., 0.01) to satisfy Ray's resource constraints while sharing the actual device with the target worker (which uses MAIN_WORKER_GPU_FRAC_FOR_COLOCATION of 0.9).

Sources: areal/infra/scheduler/ray.py279-326 areal/infra/utils/ray_placement_group.py22


Engine RPC System

The RayScheduler provides RPC methods to create engines and invoke methods on remote workers.

Creating Engines

Engines are created on workers using dynamic import paths via the create_engine method. The RayRPCServer handles the instantiation of TrainEngine or InferenceEngine subclasses.


Sources: areal/infra/scheduler/ray.py367-386 areal/infra/rpc/ray_rpc_server.py93-127

Calling Engine Methods

The call_engine and async_call_engine methods invoke methods on remote engine instances with retry logic and exponential backoff. The RayRPCServer performs RTensor.localize to handle remote tensor arguments and supports payload broadcasting for distributed training.

MethodModeDefault TimeoutDefault RetriesUse Case
call_engine()Sync7200s3Blocking control calls
async_call_engine()Async7200s3Concurrent data processing

Sources: areal/infra/scheduler/ray.py388-441 areal/infra/scheduler/ray.py443-492 areal/infra/rpc/ray_rpc_server.py128-198


Resource Allocation and Storage

Bundle Specification

Ray bundles specify CPU, GPU/NPU, and memory resources. The _bundle_spec function creates bundle dictionaries based on detected hardware. It supports NPU via is_npu_available checks.


Sources: areal/infra/utils/ray_placement_group.py72-84 areal/infra/utils/ray_placement_group.py25-35

Shared Storage Validation

Since Ray often operates in multi-node environments, the system validates that data paths are accessible across all nodes. The validate_shared_path function checks if a path is on a known network filesystem (NFS, Lustre, Ceph, etc.) and warns if local storage is used for distributed tasks.

Sources: areal/utils/fs.py58-111

Multi-Node Splitting

For multi-node allocations, _create_bundle_specs_split distributes resources across nodes to ensure large GPU requests are satisfied by full nodes.


Sources: areal/infra/utils/ray_placement_group.py38-69


Error Handling

The Ray Launcher maps Ray-specific exceptions to AReaL's internal exception hierarchy for consistent error reporting across different launchers.

Ray ExceptionAReaL ExceptionContext
ray.exceptions.GetTimeoutErrorWorkerTimeoutErrorWorker failed to respond to ping
ray.exceptions.RayActorErrorWorkerFailedErrorWorker process crashed
RPC FailureEngineCallErrorMethod execution failed on remote engine

Sources: areal/infra/scheduler/ray.py25-31 areal/infra/scheduler/ray.py104-129