VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/15.2-installation-validation

⇱ Installation Validation | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Installation Validation

This page documents the installation validation infrastructure that ensures AReaL is correctly installed and configured in both local and CI environments. The validation system performs environment checks, dependency verification, and basic functionality tests before training operations begin.

For information about the broader testing infrastructure, see 15.1 Test Infrastructure

Purpose and Scope

Installation validation serves several critical functions:

  1. Environment Verification: Confirms that the Docker runtime environment has all required dependencies and configurations.
  2. Pre-Training Checks: Validates the setup before expensive training operations begin.
  3. CI Quality Gates: Provides automated checks in GitHub Actions workflows to catch installation issues early.
  4. User Diagnostics: Offers a tool for users to verify their local installation.

The validation system operates at the boundary between system setup and actual training execution, ensuring that common configuration errors are caught before resource-intensive operations commence.

Validation Script Architecture

The validation infrastructure is built on a modular class hierarchy. BaseInstallationValidator provides core logic for parsing pyproject.toml and checking standard dependencies, while DockerInstallationValidator extends this for specialized deep-learning environments.

Class Hierarchy and Key Functions

Class / FunctionFile PathRole
BaseInstallationValidatorareal/tools/validation_base.py29Base class for dependency and version validation.
DockerInstallationValidatorareal/tools/validate_docker_installation.py24Specialized validator for Docker environments with CUDA extensions.
parse_pyprojectareal/tools/validation_base.py100-101Extracts dependency requirements from pyproject.toml.
test_cuda_functionalityareal/tools/validate_docker_installation.py176-177Executes functional tests for FP8 (Transformer Engine) and Fused Optimizers (Apex).

Sources: areal/tools/validation_base.py29-160 areal/tools/validate_docker_installation.py24-177

Validation Workflow

The following diagram maps the Natural Language validation steps to the specific code entities in the validate_docker_installation.py and validation_base.py files.

Diagram: Installation Validation Logic Flow


Sources: areal/tools/validation_base.py100-196 areal/tools/validate_docker_installation.py75-178

Docker Environment Validation

The Docker runtime environment requires validation of specialized CUDA kernels and libraries that are often compiled from source during the image build process.

Specialized CUDA Submodules

The validator checks for the existence of specific C++ extensions and submodules that are critical for performance:

PackageSubmodules ValidatedPurpose
flash-attnflash_attn_2_cudaHigh-performance attention kernels areal/tools/validation_base.py65
megatron-coremegatron.core.parallel_state, megatron.core.tensor_parallelDistributed training state management areal/tools/validate_docker_installation.py36-39
transformer_enginetransformer_engine.pytorchFP8 training support areal/tools/validate_docker_installation.py32
apexapex.optimizers, apex.normalizationFused CUDA kernels for optimizers areal/tools/validate_docker_installation.py31
deep_epdeep_epExpert Parallelism for DeepSeek-V3 MoE areal/tools/validate_docker_installation.py43
flash_mlaflash_mlaMulti-head Latent Attention (SM90+) areal/tools/validate_docker_installation.py41
deep_gemmdeep_gemmFP8 GEMM library (SM90+) areal/tools/validate_docker_installation.py42
flafla.ops, fla.layers, fla.modulesLinear attention with Triton kernels areal/tools/validate_docker_installation.py44
causal_conv1dcausal_conv1d_cudaMamba/SSM dependency areal/tools/validate_docker_installation.py45

Sources: areal/tools/validate_docker_installation.py28-46 areal/tools/validation_base.py61-70

Backend Variant Detection

AReaL supports multiple inference backends. The DockerInstallationValidator dynamically detects which backend is present and ensures they are mutually exclusive within the environment areal/tools/validate_docker_installation.py94-138

Diagram: Inference Backend Validation Logic


Sources: areal/tools/validate_docker_installation.py94-138

Functional Validation

Beyond simple imports, the system performs functional tests on CUDA-dependent features to ensure that the compiled extensions are compatible with the hardware.

Transformer Engine FP8 Test

The test_cuda_functionality method attempts to initialize a te.Linear layer and run a forward/backward pass using te.fp8_autocast areal/tools/validate_docker_installation.py182-195 This validates that the transformer_engine C++ extensions are correctly linked and the GPU supports required FP8 operations.

Apex Fused Optimizer Test

The script validates apex.optimizers.FusedAdam by performing a weight update on a small tensor areal/tools/validate_docker_installation.py180-195 This ensures the fused CUDA kernels for optimization are functional.

Sources: areal/tools/validate_docker_installation.py176-195

Critical Package List

The validator maintains a list of CRITICAL_PACKAGES that must be present for the system to function. Failure to import any of these results in a non-zero exit code.

CategoryPackages
Core Trainingtorch, megatron-core, mbridge, flash-attn
Infrastructureray, hydra-core, omegaconf, wandb
Data/Modelstransformers, datasets
Docker-Specificgrouped_gemm, apex, transformer_engine, causal_conv1d, megatron-bridge

Sources: areal/tools/validation_base.py74-90 areal/tools/validate_docker_installation.py51-61

Integration and Examples Validation

Validation extends beyond static checks to dynamic execution of representative examples. The tests/test_examples.py file contains integration tests that run full training loops for short durations to validate the end-to-end stack.

Example Test Runner

The run_example function tests/test_examples.py26-32 executes AReaL training scripts in a subprocess and monitors for a success pattern (e.g., "Train step 1/X done") tests/test_examples.py21

Validated Examples

Key examples used for installation validation include:

Sources: tests/test_examples.py21-215

Usage and Diagnostics

Running Validation

The validation is typically executed via the CLI:


Installation Prerequisites

Users should ensure their hardware and software meet the minimum requirements before running validation. For multi-node setups, a shared storage path (NAS) is required for checkpointing and logs.

RequirementRecommended Version
NVIDIA Driver550.127.08
CUDA12.8
Docker27.5.1
Python3.11 or 3.12

Sources: areal/tools/validation_base.py16-26 areal/tools/validate_docker_installation.py1-16