VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4.5-weight-update-protocols

⇱ Weight Update Protocols | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Weight Update Protocols

Overview

Weight update protocols define how trained model weights are synchronized from training engines to inference engines during online RL training. The inference engines must receive updated weights to generate rollouts with the current policy, which is critical for on-policy algorithms like PPO and for controlling staleness in asynchronous RL.

AReaL supports two primary synchronization protocols: XCCL (GPU-direct via NCCL/XCCL) and disk-based (filesystem serialization). Additionally, AReaL features an experimental Weight Update Gateway (AWEX) for managing these transfers via a centralized controller.

Related Pages:

Sources: areal/api/engine_api.py173-181 areal/engine/sglang_remote.py128-186 areal/engine/vllm_remote.py126-209

Weight Update Modes

AReaL supports three weight update modes, configured via the weight_update_mode parameter in the configuration (e.g., actor.weight_update_mode in YAML).

ModeDescriptionUse CaseRequirements
diskSerialize weights to shared filesystem; inference engines load from checkpoint.Development, debugging, heterogeneous clusters.Shared NFS/Lustre filesystem.
xcclGPU-direct transfer via NCCL/XCCL broadcast.Production, high-throughput training.High-speed GPU interconnect (NVLink, InfiniBand).
awexExperimental centralized weight exchange gateway.Advanced orchestration and heterogeneous sharding.Gateway service deployment and AWEX adapters.

Sources: examples/math/gsm8k_grpo_lora.yaml82-84 areal/api/io_struct.py183-186

Disk-Based Updates

In disk mode, the training engine serializes weights to a shared filesystem, and inference engines reload the checkpoint via HTTP endpoints. This mode is robust and works across heterogeneous environments but incurs I/O latency.

Protocol Flow:

Title: Disk-Based Weight Synchronization


Key Methods:

Sources: areal/engine/sglang_remote.py128-159 areal/engine/vllm_remote.py126-146 areal/api/io_struct.py202-213

XCCL (GPU-Direct) Updates

XCCL mode uses NCCL/XCCL collective operations to broadcast weights directly from training GPU memory to inference GPU memory, bypassing CPU and filesystem entirely.

Protocol Flow:

Title: XCCL GPU-Direct Weight Synchronization


Key Methods:

Sources: areal/engine/sglang_remote.py160-203 areal/engine/vllm_remote.py147-209 areal/api/engine_api.py618-666

Centralized Weight Update Gateway (AWEX - Experimental)

AReaL introduces an experimental microservice-based architecture for managing weight updates through the AWEX (Asynchronous Weight EXchange) system. This system handles complex parameter mapping between different parallel strategies (e.g., FSDP to TP).

Controller and Gateway

The GatewayTrainController manages the lifecycle of the services and coordinates "pairs" of training and inference workers areal/experimental/training_service/controller/controller.py29-58

Title: AWEX Weight Update Architecture


AWEX Adapters

Adapters translate internal engine states into a unified AWEX format.

Key Components:

Sources: areal/experimental/weight_update/awex/megatron_adapter.py35-110 areal/experimental/weight_update/awex/sglang_adapter.py38-110 areal/experimental/training_service/controller/controller.py29-102

Backend Implementations

SGLang Backend

SGLangBackend maps updates to SGLang's internal endpoints. Note that SGLang distributed updates (XCCL) do not currently support LoRA areal/engine/sglang_remote.py168-172

OperationEndpointImplementation
Disk Update/update_weights_from_diskareal/engine/sglang_remote.py151-157
LoRA Load/load_lora_adapterareal/engine/sglang_remote.py140-144
XCCL Update/update_weights_from_distributedareal/engine/sglang_remote.py176-184

vLLM Backend

VLLMBackend supports both full model and LoRA updates via disk and XCCL. vLLM uses a two-step metadata-then-trigger process for XCCL areal/engine/vllm_remote.py156-157

OperationEndpointImplementation
Disk Update/areal_update_weightsareal/engine/vllm_remote.py144
XCCL Meta/areal_set_update_weight_metaareal/engine/vllm_remote.py184
XCCL Trigger/areal_update_weights_xcclareal/engine/vllm_remote.py185

Sources: areal/engine/sglang_remote.py128-203 areal/engine/vllm_remote.py126-209

LoRA Adapter Updates

AReaL manages LoRA updates using versioned adapter names to allow continuous policy updates without base model reloading.

Versioned Naming: Adapters are named using the pattern {lora_name}-v{version} via get_versioned_lora_name areal/api/io_struct.py161-163

vLLM LoRA XCCL Flow: For vLLM, XCCL LoRA updates require passing lora_int_id, lora_target_modules, lora_rank, lora_alpha, and lora_bias in the metadata phase areal/engine/vllm_remote.py166-175 The backend builds a request to /areal_set_update_weight_meta_lora followed by the trigger at /areal_update_weights_lora_xccl areal/engine/vllm_remote.py179-180

Sources: areal/api/io_struct.py161-163 areal/engine/vllm_remote.py162-178

Weight Update Metadata

The WeightUpdateMeta class encapsulates all necessary information for a synchronization event.

FieldTypeDescription
typeLiteral["disk", "xccl", "awex"]The synchronization protocol to use.
path`strNone`
use_loraboolWhether this update is for a LoRA adapter.
version`intNone`
nccl_group_name`strNone`

Sources: areal/api/io_struct.py183-201