VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3.6-weight-synchronization

⇱ Weight Synchronization | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Weight Synchronization

Weight synchronization is the mechanism by which training engines propagate updated model parameters to inference engines during online reinforcement learning. This process ensures that rollout generation uses the most recent policy, enabling asynchronous RL workflows where training and inference occur on separate GPU pools.

For information about how training engines manage their own checkpoints and persistence, see 3.7 Checkpointing and Recovery For details on inference engine lifecycle management, see 4.6 Server Lifecycle Management

Overview

In AReaL's asynchronous RL architecture, training engines and inference engines operate independently on different GPU pools. After each training step, the training engine must propagate its updated weights to inference engines so that subsequent rollouts use the latest policy. This process is called weight synchronization or weight update.

The system supports three primary update modes:

  • XCCL mode: GPU-to-GPU transfer using collective communication (NCCL/HCCL).
  • Disk mode: File system-based transfer through checkpoint save/load.
  • AWEX mode: An optimized asynchronous weight exchange protocol (Experimental).

Weight synchronization is controlled by the weight_update_mode configuration parameter and orchestrated through the WeightUpdateMeta dataclass, which encapsulates all metadata required for the update operation areal/api/io_struct.py167-185

Sources: areal/api/io_struct.py167-211 areal/api/engine_api.py175-184

Weight Update Modes

XCCL Mode (GPU-to-GPU)

XCCL mode performs direct GPU-to-GPU weight transfer using PyTorch's distributed communication primitives (primarily dist.broadcast). This approach is faster and more efficient than disk-based transfers but requires establishing a shared process group between training and inference engines.

Key characteristics:

Sources: areal/api/io_struct.py168-174 areal/engine/vllm_remote.py147-186 areal/engine/vllm_ext/vllm_worker_extension.py133-165

Disk Mode (Filesystem-based)

Disk mode saves updated weights to a shared filesystem location and notifies inference engines to reload from disk via HTTP endpoints.

Key characteristics:

  • Robustness: No process group coordination required; works across heterogeneous clusters or when network topology prevents direct NCCL connections.
  • Implementation: Training engines save state dicts (e.g., via torch.save or save_model_to_hf) to the path specified in WeightUpdateMeta areal/api/io_struct.py169
  • Inference Reload: SGLangBackend build requests for /update_weights_from_disk areal/engine/sglang_remote.py151-158 vLLMBackend uses /areal_update_weights for full model updates from disk areal/engine/vllm_remote.py141

Sources: areal/api/io_struct.py167-181 areal/engine/sglang_remote.py128-159 areal/engine/vllm_remote.py126-145

WeightUpdateMeta Structure

The WeightUpdateMeta dataclass carries all metadata required for a weight update operation. It is created by the training engine and passed to both the training and inference engines during synchronization.

FieldTypeDescription
type"disk", "xccl", "awex"Weight update mode areal/api/io_struct.py168
pathstr | NoneFilesystem path for disk mode areal/api/io_struct.py169
nccl_group_namestr | NoneProcess group identifier for XCCL areal/api/io_struct.py174
use_loraboolWhether updating LoRA adapters instead of full model areal/api/io_struct.py177
lora_namestrLoRA adapter identifier areal/api/io_struct.py178
versionint | NoneMonotonically increasing version number areal/api/io_struct.py185

Sources: areal/api/io_struct.py167-185

Training Engine Integration

Training engines implement the update_weights() and connect_engine() methods to manage weight synchronization with inference engines areal/api/engine_api.py175-194

Connection Establishment

Training engines initialize weight update groups. For example, the AwexMegatronAdapter uses init_weights_update_group to establish a shared communication channel for training ranks areal/experimental/weight_update/awex/megatron_adapter.py154-162

Code Entity Interaction: Connection


Sources: areal/api/engine_api.py185-194 areal/experimental/weight_update/awex/megatron_adapter.py128-162 areal/experimental/weight_update/nccl_group.py13-23

Weight Update Execution

The update_weights() method performs the actual weight transfer. In AwexMegatronAdapter, this involves building a transfer plan, preparing send operations via nccl_build_send_ops, and executing them with batch_send_recv areal/experimental/weight_update/awex/megatron_adapter.py173-179

Weight Update Logic Flow


Sources: areal/experimental/weight_update/awex/megatron_adapter.py163-180 areal/api/engine_api.py175-183

Version Management

Weight version tracking prevents training on stale rollouts and enables staleness-aware algorithms.

Version Tracking in Responses

Inference engines embed the current weight version in each generated response. The output_versions field in ModelResponse tracks the version of the model used to generate the tokens areal/api/io_struct.py68 Versioned LoRA names are generated using get_versioned_lora_name(lora_name, version) areal/api/io_struct.py161-163

Sources: areal/api/io_struct.py63-68 areal/api/io_struct.py161-163

LoRA Support

AReaL provides native support for LoRA weight synchronization, allowing efficient updates of adapter layers without reloading the base model.

LoRA Update Protocol

LoRA updates require specific fields in WeightUpdateMeta:

Sources: areal/api/io_struct.py177-181

Backend Implementation

Sources: areal/engine/vllm_remote.py130-186 areal/engine/sglang_remote.py132-172 areal/engine/vllm_ext/vllm_worker_extension.py58-131 areal/experimental/weight_update/awex/sglang_adapter.py112-160

Memory Management During Updates

Model Sharding (Megatron/TP)

When using Megatron-style Tensor Parallelism, the system must all-gather sharded parameters before synchronization to the inference engine. all_gather_param handles gathering sharded tensors along the partition dimension to reconstruct full weights areal/engine/megatron_utils/megatron.py95-152

Sources: areal/engine/megatron_utils/megatron.py95-152 areal/engine/megatron_utils/megatron.py26-40

FP8 Support

For high-performance training, AReaL supports FP8 weight synchronization. _all_gather_fp8_tensor_and_concat handles the collective communication of both the rowwise data and the rowwise scale inversions for Float8BlockwiseQTensor types areal/engine/megatron_utils/megatron.py63-91

Sources: areal/engine/megatron_utils/megatron.py63-91 areal/engine/megatron_utils/megatron.py105-110