VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/9.7-skypilot-integration

⇱ SkyPilot Integration | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

SkyPilot Integration

This page describes how to deploy and run AReaL experiments using SkyPilot, a cloud orchestration tool that enables transparent deployment across 17+ cloud providers and Kubernetes clusters. For information about other deployment methods, see Local Launcher, Ray Launcher, and SLURM Launcher.

Purpose and Scope

SkyPilot integration provides infrastructure-agnostic deployment for AReaL workloads by abstracting cloud-specific setup details. It acts as a provisioning layer that launches nodes, configures the environment, and executes setup scripts.

Key features include:

Sources: examples/skypilot/README.md1-19 examples/skypilot/README.md198-201 examples/skypilot/single_node.sky.yaml1-11

Architecture Overview

SkyPilot bridges the gap between high-level infrastructure requirements and the low-level code entities required for distributed RL training.

Title: SkyPilot Deployment and Execution Flow


Deployment Lifecycle:

  1. Provisioning: sky launch provisions VMs or Pods via the selected cloud provider based on the resources block examples/skypilot/single_node.sky.yaml3-11
  2. Node Setup: SkyPilot injects environment variables like SKYPILOT_NODE_RANK and SKYPILOT_NODE_IPS to facilitate cluster discovery examples/skypilot/ray_cluster.sky.yaml18-21
  3. Execution: The run command launches the AReaL entry point (e.g., gsm8k_rl.py), which initializes the configured Scheduler examples/skypilot/single_node.sky.yaml23-25
  4. Storage: Cloud buckets are mounted to a local path (e.g., /storage) for checkpoints and weight synchronization examples/skypilot/single_node.sky.yaml15-18

Sources: examples/skypilot/README.md9-67 examples/skypilot/single_node.sky.yaml1-33 examples/skypilot/ray_cluster.sky.yaml17-47

SkyPilot YAML Configuration Structure

SkyPilot configuration files define the hardware, software environment, and execution logic for AReaL.

Title: SkyPilot Configuration Schema for AReaL


Key Configuration Sections:

FieldPurposeExample
nameCluster identifier for managementareal-test-skypilot examples/skypilot/single_node.sky.yaml1
resources.acceleratorsGPU type and countA100:2 examples/skypilot/single_node.sky.yaml4
resources.image_idDocker runtime imagedocker:ghcr.io/inclusionai/areal-runtime:v1.0.4-sglang examples/skypilot/single_node.sky.yaml11
num_nodesNumber of nodes in the cluster2 examples/skypilot/ray_cluster.sky.yaml8
file_mountsCloud storage mappingMaps S3/GCS to /storage examples/skypilot/single_node.sky.yaml15-18
runShell commands to executeStarts Ray and AReaL examples/skypilot/ray_cluster.sky.yaml17-47

Sources: examples/skypilot/single_node.sky.yaml1-33 examples/skypilot/ray_cluster.sky.yaml1-16 examples/skypilot/README.md17-38

Single-Node Deployment

Single-node deployment utilizes the LocalScheduler for managing training and inference engines on a single machine.

Configuration Example

The following configuration launches a GSM8K GRPO experiment on 2 A100 GPUs.


Key Implementation Details:

Sources: examples/skypilot/README.md9-50 examples/skypilot/single_node.sky.yaml1-33

Multi-Node Deployment with Ray

Multi-node deployment requires establishing a Ray cluster across SkyPilot nodes before launching the AReaL experiment.

Ray Cluster Setup Pattern

Title: Multi-Node Initialization Sequence


Full Multi-Node Configuration

The run block handles the conditional logic for head vs. worker nodes.


Implementation Breakdown:

Sources: examples/skypilot/ray_cluster.sky.yaml1-47 examples/skypilot/README.md120-152

Docker Image and Build System

AReaL provides pre-built Docker images optimized for SkyPilot deployment. These are managed via a centralized CI/CD pipeline.

Image Variants

The build system produces variants tailored to different inference backends.

VariantTag SuffixDescription
sglang-sglangOptimized for SGLangBackend .github/workflows/tag-release-image.yml158-170
vllm-vllmOptimized for VLLMBackend .github/workflows/tag-release-image.yml171-182

CI/CD Pipeline

The Build Release Docker Image workflow automates the following:

  1. Provisioning: Starts a GCP instance (areal-docker-builder) to perform the build .github/workflows/tag-release-image.yml26-72
  2. Version Extraction: Reads the single source of truth from pyproject.toml .github/workflows/tag-release-image.yml136-137
  3. Build & Push: Builds images for both sglang and vllm variants and pushes them to ghcr.io/inclusionai/areal-runtime .github/workflows/tag-release-image.yml158-182

Versioning

Runtime versioning is managed by the VersionInfo class in areal/version.py, which tracks package version, git commit, and repository "dirty" status areal/version.py18-49

Sources: .github/workflows/tag-release-image.yml1-182 areal/version.py1-91

Kubernetes and RDMA Support

For high-performance training on Kubernetes clusters with InfiniBand (RDMA), specific security contexts are required to allow IPC_LOCK capabilities, which are necessary for NCCL/XCCL zero-copy memory transfers.


Sources: examples/skypilot/README.md154-168

Best Practices

Sources: examples/skypilot/README.md178-195 examples/skypilot/single_node.sky.yaml22-32