Last indexed: 7 May 2026 (2e12c1)

Ray Launcher

The Ray Launcher provides distributed worker management for AReaL training on Ray clusters. It implements the Scheduler interface using Ray's actor system and placement groups to orchestrate training and inference workers across multi-node GPU/NPU clusters.

For local single-node execution, see Local Launcher. For HPC cluster integration, see SLURM Launcher. For general scheduler concepts, see Scheduler API.

Sources: areal/infra/scheduler/ray.py58-77 areal/api/scheduler_api.py43-49

Architecture Overview

The RayScheduler manages worker processes as Ray actors, each running a RayRPCServer that hosts engine instances. Workers are organized into placement groups to control GPU allocation and enable colocation.

Ray Scheduler Component Mapping

The following diagram maps high-level system components to their corresponding code entities within the Ray infrastructure.

Sources: areal/infra/scheduler/ray.py48-56 areal/infra/scheduler/ray.py59-80 areal/api/scheduler_api.py14-33

Placement Strategies

Ray placement strategies determine how workers are allocated to GPUs and nodes. AReaL provides three strategies via SchedulingSpec.ray_placement_strategy.

Strategy Types

Strategy	Use Case	Implementation Class	GPU Allocation
`shared`	Training workers	`SharedRayPlacementStrategy`	Multiple workers share 1 PG; each gets 1 bundle
`separate`	Rollout workers	`SeparatedRayPlacementStrategy`	1 worker gets 1 exclusive PG
`deferred`	Rollout with local servers	`DeferredDeviceRayPlacementStrategy`	Worker requests 0 GPU; server subprocess claims GPU

Sources: areal/infra/utils/ray_placement_group.py118-235 areal/infra/scheduler/ray.py145-166

Worker Creation and Lifecycle

Creating Workers

The create_workers method spawns Ray actors based on a Job specification. It uses _prepare_worker_specs to validate the task list against the replica count.

Sources: areal/infra/scheduler/ray.py167-205 areal/infra/scheduler/ray.py85-102 areal/infra/rpc/ray_rpc_server.py23-35

Worker Lifecycle Methods

Method	Purpose	Key Operations
`create_workers(job)`	Spawn new workers	Create PG, spawn actors, ping, configure
`get_workers(role, timeout)`	Retrieve worker list	Ping workers, return `Worker` objects
`delete_workers(role)`	Clean up workers	Kill actors, remove PGs
`fork_workers(role, target_role)`	Create colocated workers	Spawn actors on existing PG

Sources: areal/infra/scheduler/ray.py167-205 areal/infra/scheduler/ray.py207-227 areal/infra/scheduler/ray.py229-277 areal/infra/scheduler/ray.py279-326

Colocation and Forking

AReaL supports two colocation modes for sharing GPUs between different worker roles:

Reuse Colocation (fork=False)

Reuses existing workers without spawning new actors. Multiple roles share the same worker processes. The scheduler resolves this by mapping the role to the target role in _colocated_roles.

Sources: areal/infra/scheduler/ray.py207-227

Fork Colocation (fork=True)

Spawns new actors on the same placement groups as the target role, enabling true multi-tenancy. This is often used for reference models or critics sharing a GPU with an actor.

Forked workers use a minimal GPU fraction (e.g., 0.01) to satisfy Ray's resource constraints while sharing the actual device with the target worker (which uses MAIN_WORKER_GPU_FRAC_FOR_COLOCATION of 0.9).

Sources: areal/infra/scheduler/ray.py279-326 areal/infra/utils/ray_placement_group.py22

Engine RPC System

The RayScheduler provides RPC methods to create engines and invoke methods on remote workers.

Creating Engines

Engines are created on workers using dynamic import paths via the create_engine method. The RayRPCServer handles the instantiation of TrainEngine or InferenceEngine subclasses.

Sources: areal/infra/scheduler/ray.py367-386 areal/infra/rpc/ray_rpc_server.py93-127

Calling Engine Methods

The call_engine and async_call_engine methods invoke methods on remote engine instances with retry logic and exponential backoff. The RayRPCServer performs RTensor.localize to handle remote tensor arguments and supports payload broadcasting for distributed training.

Method	Mode	Default Timeout	Default Retries	Use Case
`call_engine()`	Sync	7200s	3	Blocking control calls
`async_call_engine()`	Async	7200s	3	Concurrent data processing

Sources: areal/infra/scheduler/ray.py388-441 areal/infra/scheduler/ray.py443-492 areal/infra/rpc/ray_rpc_server.py128-198

Resource Allocation and Storage

Bundle Specification

Ray bundles specify CPU, GPU/NPU, and memory resources. The _bundle_spec function creates bundle dictionaries based on detected hardware. It supports NPU via is_npu_available checks.

Sources: areal/infra/utils/ray_placement_group.py72-84 areal/infra/utils/ray_placement_group.py25-35

Shared Storage Validation

Since Ray often operates in multi-node environments, the system validates that data paths are accessible across all nodes. The validate_shared_path function checks if a path is on a known network filesystem (NFS, Lustre, Ceph, etc.) and warns if local storage is used for distributed tasks.

Sources: areal/utils/fs.py58-111

Multi-Node Splitting

For multi-node allocations, _create_bundle_specs_split distributes resources across nodes to ensure large GPU requests are satisfied by full nodes.

Sources: areal/infra/utils/ray_placement_group.py38-69

Error Handling

The Ray Launcher maps Ray-specific exceptions to AReaL's internal exception hierarchy for consistent error reporting across different launchers.

Ray Exception	AReaL Exception	Context
`ray.exceptions.GetTimeoutError`	`WorkerTimeoutError`	Worker failed to respond to ping
`ray.exceptions.RayActorError`	`WorkerFailedError`	Worker process crashed
RPC Failure	`EngineCallError`	Method execution failed on remote engine

Sources: areal/infra/scheduler/ray.py25-31 areal/infra/scheduler/ray.py104-129

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/9.5-ray-launcher

⇱ Ray Launcher | inclusionAI/AReaL | DeepWiki