Last indexed: 7 May 2026 (2e12c1)

Docker Images

This document describes the Docker runtime images used by AReaL, including their multi-variant architecture, build process, CUDA package compilation, and usage in CI/CD workflows. The Docker images provide a pre-configured environment with all dependencies, optimized CUDA kernels, and GPU runtime capabilities for both training and inference.

Image Overview

AReaL provides Docker runtime images that package the complete execution environment. Unlike standard installations, the Docker images include pre-compiled C++ extensions and heavy CUDA dependencies that are often slow to build from source.

Key Components

Base OS: Ubuntu-based image with CUDA 12.9 support via the SGLang base Dockerfile12
Python Environment: Python 3.11+ managed via uv with bytecode compilation enabled Dockerfile36-163 pyproject.toml10
Inference Backends: Mutually exclusive variants for sglang or vllm Dockerfile1-15
Compiled Extensions: Pre-installed flash-attn (v2 & v3), apex, TransformerEngine, DeepEP, DeepGEMM, and FlashMLA Dockerfile83-158
System Libraries: InfiniBand/RDMA support (libibverbs, librdmacm) and distributed tools (kmod, cmake) Dockerfile22-30

Sources: Dockerfile1-163 pyproject.toml1-13 docs/en/tutorial/installation.md39-58

Multi-Variant Architecture

The AReaL Docker system uses a single Dockerfile to produce two distinct image variants. This is controlled via the VARIANT build argument, which determines the specific versions of PyTorch and inference backends installed. This architecture is necessary because sglang and vllm pin mutually-incompatible torch and torchao versions pyproject.vllm.toml3-5 areal/tools/check_pyproject_consistency.py28-46

Variant Comparison

Feature	`sglang` Variant (Default)	`vllm` Variant
Base Image	`lmsysorg/sglang:v0.5.10.post1-runtime`	`lmsysorg/sglang:v0.5.10.post1-runtime`
PyTorch Version	`2.9.1+cu129` Dockerfile62	`2.10.0+cu129` Dockerfile62
Primary Backend	`sglang` pyproject.toml148	`vllm` pyproject.vllm.toml161
Flash Attention	Linked against Torch 2.9 Dockerfile124	Linked against Torch 2.10 Dockerfile124
Lock File	`uv.lock` scripts/uv_lock.sh9	`uv.vllm.lock` scripts/uv_lock.sh10

Natural Language to Code Entity Mapping: Build Variants

The following diagram shows how the VARIANT argument flows through the Dockerfile to configure the environment and how the build system selects dependencies.

Diagram: Build Variant Orchestration

Sources: Dockerfile1-15 Dockerfile61-64 Dockerfile121-128 pyproject.vllm.toml1-15 areal/tools/check_pyproject_consistency.py28-50

CUDA Package Compilation

A critical role of the Docker image is providing pre-compiled binaries for complex CUDA extensions. The build process is structured into stages to maximize cache efficiency.

Stage 1: Base Torch and Build Tools

The image first establishes the specific PyTorch version and installs build-essential tools like cmake, ccache, and uv Dockerfile22-64

Stage 2: Heavy C++ Dependencies

Packages that require long compilation times or specific hardware architectures (SM80, SM89, SM90) are installed before the main project code to prevent frequent recompilation Dockerfile48-81

NVIDIA Apex: Compiled with APEX_CUDA_EXT=1 for fused kernels Dockerfile84-86
Transformer Engine: For FP8 training support Dockerfile89-90
DeepSeek-V3 Kernels:
- FlashMLA: Multi-head Latent Attention Dockerfile93-97
- DeepGEMM: FP8 GEMM library Dockerfile100-104
- DeepEP: Expert Parallelism communication library for MoE; SM90 features enabled via TORCH_CUDA_ARCH_LIST="9.0 9.0a" Dockerfile109-112
Flash Attention: Instead of compiling from source (which takes ~30 mins), the Dockerfile downloads, repacks, and installs specific pre-built wheels matching the Torch version (2.9 or 2.10) Dockerfile121-158
Causal Conv1d: Required for models like Qwen-3.5 Dockerfile115-118

Sources: Dockerfile22-158 docs/en/tutorial/installation.md126-162

Image Tags and Versioning

Images are hosted on GitHub Container Registry (GHCR) at ghcr.io/inclusionai/areal-runtime docs/en/tutorial/installation.md27

Tag Structure

v1.0.4-sglang: Pinned version for the SGLang variant docs/en/tutorial/installation.md27
v1.0.4-vllm: Pinned version for the vLLM variant docs/en/tutorial/installation.md51
dev-sglang / dev-vllm: Development tags for testing Dockerfile6-7

Consistency Management

Since the project maintains two pyproject.toml files, a consistency checker areal/tools/check_pyproject_consistency.py ensures that non-inference dependencies remain identical across both variants areal/tools/check_pyproject_consistency.py6-9 The ESCAPABLE_PACKAGES list defines which packages (like torch, sglang, vllm) are permitted to differ areal/tools/check_pyproject_consistency.py32-46

Sources: docs/en/tutorial/installation.md27-58 areal/tools/check_pyproject_consistency.py1-50 pyproject.toml11

Runtime and Orchestration

System Mapping: Docker Runtime to Execution

The following diagram bridges the Docker container setup with the training environment configuration.

Diagram: Runtime Environment Mapping

Usage and Deployment

The Docker image provides the necessary environment to start ray clusters or local training jobs without additional installation steps. It is the recommended way to run AReaL to avoid local environment conflicts docs/en/tutorial/installation.md39-43

Sources: docs/en/tutorial/installation.md39-58 Dockerfile52-53 scripts/uv_lock.sh7-10

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/12.3-docker-images

⇱ Docker Images | inclusionAI/AReaL | DeepWiki