VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3.8-memory-management

⇱ Memory Management | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Memory Management

This page describes memory optimization techniques and memory management strategies in AReaL's training engines. It covers configuration parameters, optimization methods for training and inference, and memory-efficient weight synchronization mechanisms.

For configuration details, see 2.4 Training Engine Configurations For parallelism strategies that affect memory, see 8. Parallelism and Distribution For troubleshooting OOM errors, see 16.2 Memory and OOM Issues

Memory Optimization Overview

AReaL provides multiple complementary memory optimization strategies across three phases of the training pipeline:

PhaseOptimization TechniquesPrimary Configuration
Inference/RolloutMemory fraction tuning, Parallelism adjustment, Concurrent rollout limitingsglang.mem_fraction_static, max_concurrent_rollouts, allocation_mode
Training/GradientGradient checkpointing, Micro-batch sizing, CPU offloading, Per-layer optimizer stepgradient_checkpointing, mb_spec.max_tokens_per_mb, offload_params, per_layer_optim_step
Weight UpdatesDisk-based sync, Chunked NCCL transfersweight_update_mode, weight_chunked_mem_mb

Sources: docs/en/best_practices/handling_oom.md1-40 docs/en/best_practices/handling_oom.md200-226

Core Memory Parameters

Critical Memory Configuration

The following parameters have the most significant impact on peak memory usage:

Micro-batch Token Limit (actor.mb_spec.max_tokens_per_mb):

Concurrent Rollout Limit (max_concurrent_rollouts):

Allocation Mode (allocation_mode):

Inference Memory Fraction (sglang.mem_fraction_static):

Sources: docs/en/best_practices/handling_oom.md8-40 docs/en/best_practices/handling_oom.md45-80 areal/api/cli_args.py99-126

Memory Parameter Interaction

"Natural Language Config" to "Code Entity" mapping:

Title: Configuration to Code Entity Mapping


Sources: docs/en/best_practices/handling_oom.md8-40 areal/api/cli_args.py99-126 areal/engine/fsdp_engine.py203

Training Memory Optimizations

Gradient Checkpointing

Gradient checkpointing reduces activation memory by recomputing activations during the backward pass instead of storing them docs/en/best_practices/handling_oom.md105-106

Configuration:


Implementation:

Sources: docs/en/best_practices/handling_oom.md105-110 areal/engine/fsdp_engine.py203-210 areal/experimental/engine/archon_engine.py76-78 areal/engine/megatron_engine.py29-31

CPU Offloading

CPU offloading moves parameters and optimizer states to CPU memory, streaming them to GPU only when needed.

Configuration:


Memory Flow:

Title: Parameter and State Memory Flow


Sources: docs/en/best_practices/handling_oom.md173-181 areal/engine/fsdp_engine.py203 areal/engine/fsdp_utils/optimizer.py320

Per-Layer Optimizer Step

When optimizer states are offloaded to CPU, the default CPU-based optimizer step is slow. PerLayerOptimWrapper areal/engine/fsdp_utils/optimizer.py320 streams optimizer states per-layer to GPU for updates.

Configuration:


Key Implementation Details:

Sources: docs/en/best_practices/handling_oom.md167-181 areal/engine/fsdp_utils/optimizer.py320-632

Memory-Efficient Model Loading

Standard model loading loads full weights onto each GPU before sharding. Memory-efficient loading avoids this by using rank-0 broadcast.

Configuration:


Implementation: The function fsdp2_load_full_state_dict() in areal/engine/fsdp_utils/__init__.py implements this:

Sources: docs/en/best_practices/handling_oom.md203-223 areal/engine/fsdp_utils/__init__.py110-161

Precision Reduction for Optimizer States

AnyPrecisionAdamW areal/engine/fsdp_utils/optimizer.py44 allows configuring the precision of optimizer momentum and variance states.

Configuration:


Implementation:

Sources: docs/en/best_practices/handling_oom.md189-201 areal/engine/fsdp_utils/optimizer.py44-189

Weight Synchronization Memory

Weight updates between training and inference engines can cause memory spikes during the transfer phase.

Disk-Based Weight Updates

Configuration:


Mechanism:

NCCL-Based Weight Updates with Chunking

If using NCCL (default), memory can be managed by reducing the transfer buffer size:

Sources: docs/en/best_practices/handling_oom.md200-226 areal/experimental/engine/archon_weight_sync.py62-67

Decision Tree for Memory Optimization

Title: OOM Resolution Decision Logic


Sources: docs/en/best_practices/handling_oom.md1-226