VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/15.1-test-infrastructure

⇱ Test Infrastructure | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Test Infrastructure

Purpose and Scope

This document describes AReaL's testing infrastructure, including pytest configuration, test organization, markers, and specialized validation tools. The infrastructure is designed to handle complex distributed RL environments, validating both core logic and multi-backend integration (sglang vs vllm).

The system encompasses unit tests for core modules, integration tests for full training workflows, and automated CI/CD pipelines that provision hardware-accelerated runners on GCP.


Test Organization and Markers

AReaL uses pytest as its primary testing framework. The test suite is designed to verify numerical consistency, loss convergence, and distributed communication.

Pytest Markers

The codebase utilizes custom markers to categorize tests based on resource requirements and execution time. These markers allow for selective execution in different environments (e.g., local dev vs. CI).

MarkerDescription
slowTests that take more than 30 seconds to run. Excluded from CI unless also marked with ci.
ciSpecifically identifies slow tests that are mandatory for the CI/CD workflow.
gpuTests requiring a single GPU.
multi_gpuTests requiring more than one GPU.
asyncioUsed for testing asynchronous components like the Inference Data Proxy or TrainController.

Scheduler and Controller Testing

Specialized tests exist for the distributed orchestration layer. For instance, tests/test_local_scheduler.py uses extensive mocking of subprocess.Popen and requests to validate worker lifecycle management, GPU allocation, and error handling without requiring actual hardware tests/test_local_scheduler.py44-188

Similarly, tests/test_train_controller.py utilizes a MockScheduler and MockTrainEngine to verify that the TrainController correctly identifies Data Parallel (DP) heads, manages worker environments, and handles distributed batch operations tests/test_train_controller.py38-105

Sources: tests/test_local_scheduler.py44-188 tests/test_train_controller.py38-105


Continuous Integration (CI) Pipelines

AReaL employs complex GitHub Actions workflows to manage testing across different hardware backends and installation methods.

Multi-Variant Testing

The CI system dynamically determines test variants based on PR labels (safe-to-test) or manual inputs .github/workflows/test-areal.yml46-60 It supports testing both sglang and vllm variants of the runtime image.

GCP Runner Provisioning

For tests requiring GPUs, the workflow provisions ephemeral GCP instances (e.g., a2-highgpu-2g) .github/workflows/test-areal.yml78

  1. Startup Script: A template script installs Docker, pulls the specified areal-runtime image, and registers a self-hosted GitHub runner inside the container .github/workflows/test-areal.yml124-156
  2. Containerization: The runner executes within a privileged container with --runtime=nvidia, --net=host, and large shared memory (--shm-size) to support high-performance distributed training .github/workflows/test-areal.yml141-152

Image Baking

To reduce CI latency, a "Bake" workflow creates GCP OS images with pre-pulled Docker images .github/workflows/bake-gcp-image.yml1-32 This workflow handles GHCR authentication and ensures both sglang and vllm variants are cached on the disk .github/workflows/bake-gcp-image.yml103-114

Sources: .github/workflows/test-areal.yml46-160 .github/workflows/bake-gcp-image.yml1-114


Installation Validation

AReaL provides dedicated scripts and CI jobs to validate the complex dependency tree required for distributed RL.

Automated Import Verification

The install-test.yml workflow verifies that core modules like TrainController, RolloutController, and WorkflowExecutor can be imported across different OSs (Ubuntu, macOS) and Python versions .github/workflows/install-test.yml61-72

Docker and CUDA Validation

  • Variant Consistency: Ensures that the vllm variant is tested using its specific pyproject.vllm.toml and uv.vllm.lock files .github/workflows/install-test.yml113-117
  • Validation Tool: The script areal/tools/validate_docker_installation.py is executed within the runtime container to perform deep checks on CUDA-dependent packages .github/workflows/install-test.yml180-182
  • Installation Validation: The tool areal/tools/validate_installation.py provides a general-purpose environment check.

Sources: .github/workflows/install-test.yml61-182 .github/workflows/test-areal.yml75


Test Infrastructure Architecture

The following diagrams bridge the gap between high-level testing concepts and the code entities that implement them.

Test Execution and Mocking Space Title: "Mocking Strategy for Scheduler and Controller Tests"


Sources: tests/test_local_scheduler.py30-101 tests/test_train_controller.py38-150 areal/infra/rpc/rpc_server.py33-62

CI/CD Provisioning Flow Title: "GCP CI Infrastructure Pipeline"


Sources: .github/workflows/test-areal.yml78-156 .github/workflows/bake-gcp-image.yml39-59 .github/workflows/install-test.yml180-182


Code Quality and Environment Standards

AReaL enforces strict code quality through pre-commit hooks and ruff linting.

Environment Management

The project uses uv for dependency management, maintaining separate lock files for sglang and vllm backends to ensure reproducible test environments .github/workflows/install-test.yml46-48

Continuous Validation

Sources: .github/workflows/install-test.yml46-80 .github/workflows/runner-heartbeat.yml1-103 .github/workflows/stale-issues.yml1-33