VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.8-shared-storage-validation

⇱ Shared Storage Validation | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Shared Storage Validation

Purpose and Scope

Shared Storage Validation ensures that all workers in a distributed AReaL deployment can access a common filesystem for checkpoints, logs, and model weights. This validation is critical for distributed training workflows where multiple nodes must coordinate through shared files, such as when using the nfs name resolution strategy or disk-based weight synchronization.

In AReaL's asynchronous architecture, shared storage serves several purposes:

Sources: areal/infra/scheduler/local.py49-165 areal/infra/scheduler/slurm.py122-187


Shared Storage in AReaL Architecture

Shared storage sits at the intersection of the scheduler system and the execution engines, enabling coordination across distributed workers. The validate_shared_path utility areal/infra/scheduler/local.py49 is the primary mechanism for ensuring these paths are valid and reside on appropriate network filesystems during scheduler initialization.

Distributed Storage Interaction


Diagram: Shared Storage Validation Architecture

Sources: areal/infra/scheduler/local.py138-144 areal/infra/scheduler/slurm.py120-126 areal/api/scheduler_api.py13-33


Storage Requirements

AReaL's shared storage must satisfy several requirements to support distributed training:

RequirementDescriptionImplementation Context
Cross-node accessibilityAll workers must access identical pathsLocalScheduler / SlurmScheduler init areal/infra/scheduler/local.py138
Network FS TypeMust be a distributed filesystem (NFS, Lustre, etc.)validate_shared_path areal/infra/scheduler/local.py49
Path ExistenceThe directory must exist or be creatablelog_dir.mkdir areal/infra/scheduler/local.py171
Experiment ContextPaths are namespaced by experiment and trialnames.trial_root areal/infra/scheduler/local.py150

Scheduler Integration

Both LocalScheduler and SlurmScheduler perform validation of the fileroot and nfs_record_root during instantiation:

Sources: areal/infra/scheduler/local.py138-171 areal/infra/scheduler/slurm.py120-134 areal/infra/scheduler/ray.py69-72


Validation Implementation

The validation logic ensures that the provided paths are not only present but also reside on a filesystem capable of multi-node consistency.

Validation Flow


Diagram: Path Validation Sequence in Schedulers

Sources: areal/infra/scheduler/local.py138-151 areal/infra/scheduler/slurm.py120-133


Infrastructure Validation

Beyond filesystem paths, the system validates the network and hardware environment required for distributed RPC communication.

Network and Port Validation

The scheduler ensures that network resources are available before launching worker processes:

Hardware and Resource Validation

  • GPU Detection: _get_device_count_safely scans /dev for nvidia or davinci devices to count GPUs without initializing a CUDA context, preventing premature resource lock areal/infra/scheduler/local.py73-90
  • GPU Allocation: _allocate_gpus ensures the requested number of devices does not exceed available hardware on the local node areal/infra/scheduler/local.py203-210
  • Placement Strategies: RayScheduler validates resource constraints using strategies like DeferredDeviceRayPlacementStrategy, SeparatedRayPlacementStrategy, or SharedRayPlacementStrategy areal/infra/scheduler/ray.py145-166

Sources: areal/infra/scheduler/local.py51-210 areal/infra/scheduler/ray.py104-166


Common Validation Failures and Warnings

Non-Network Filesystem Warning

If a path is detected on a local disk (e.g., /tmp on a local SSD), the system warns the user via validate_shared_path areal/infra/scheduler/local.py49 This is critical because:

  1. Weight Sync: Training engines write weights to a directory that inference engines on other nodes must read.
  2. Name Resolution: The nfs record root MUST be shared for name_resolve to function across different IPs areal/infra/scheduler/local.py140-144

Scheduler Exception Handling

The validation system uses specific exceptions to signal storage and environment failures:

Sources: areal/infra/scheduler/local.py35-208 areal/infra/scheduler/ray.py116-128