VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/16.2-memory-and-oom-issues

⇱ Memory and OOM Issues | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Memory and OOM Issues

This page provides a comprehensive technical guide to diagnosing and resolving out-of-memory (OOM) errors in AReaL. It covers memory management strategies across generation, training, and weight synchronization, focusing on the implementation details of FSDPEngine, ArchonEngine, and MegatronEngine backends.

Memory Usage Overview

AReaL's memory footprint is dynamic, peaking at different stages of the RL loop. Understanding the data flow between components is critical for pinpointing which process (Inference or Training) is exceeding GPU limits.

System Memory Architecture and Code Entities:


Sources: areal/api/cli_args.py18 docs/en/best_practices/handling_oom.md6-39

Core Memory Parameters

ParameterCode EntityImpact
Micro-batch TokensMicroBatchSpec.max_tokens_per_mbPrimary control for training activation memory. areal/api/cli_args.py18
Concurrent Rolloutsmax_concurrent_rolloutsControls inference server KV cache pressure. docs/en/best_practices/handling_oom.md45-54
Max Lengthtrain_dataset.max_lengthDefines the minimum possible memory footprint per sequence. docs/en/best_practices/handling_oom.md15-17
Offload PolicyCPUOffloadPolicyDetermines if parameters/optimizer states stay on CPU. areal/engine/fsdp_utils/__init__.py10-13

Sources: areal/api/cli_args.py18 docs/en/best_practices/handling_oom.md6-39 areal/engine/fsdp_utils/__init__.py10-13

Training Memory Optimizations

1. Per-Layer Optimizer Step

Standard FSDP with offload_params: true performs optimizer updates on the CPU, which is slow. AReaL implements PerLayerOptimWrapper to stream optimizer states (momentum/variance) to the GPU one layer at a time. This is compatible with both offload_params: true and false.

Data Flow for Per-Layer Updates:


Sources: areal/engine/fsdp_utils/optimizer.py44-101 docs/en/best_practices/handling_oom.md167-181 tests/test_per_layer_optim_step.py124-145

2. Memory-Efficient Model Loading

Large models often OOM during from_pretrained if every rank attempts to load the full weights. AReaL uses a tiered initialization strategy:

  1. Meta-device Init: Model structure is created without allocating weight memory.
  2. FSDP Wrapping: apply_fsdp2 shards the meta-tensors across the DeviceMesh. areal/engine/fsdp_utils/__init__.py62-108
  3. Rank-0 Broadcast: Only Rank 0 loads the weights from disk and broadcasts them via NCCL using fsdp2_load_full_state_dict. areal/engine/fsdp_utils/__init__.py110-141

Sources: areal/engine/fsdp_utils/__init__.py62-141 docs/en/best_practices/handling_oom.md203-223

3. MoE Expert Sharding

For Mixture-of-Experts (MoE) models, AReaL's ArchonEngine uses specialized converters to prevent memory spikes. MoEWeightConverter in Archon calculates local_experts_indices based on DTensor placements to perform sharded loads instead of full expert concatenation.

Sources: areal/experimental/models/archon/moe_weight_converter.py46-61 areal/experimental/models/archon/moe_weight_converter.py124-146

Diagnosing and Resolving OOM

Generation (Inference) OOM

If the inference backend (SGLang/vLLM) crashes with OOM:

Training OOM

If the training process crashes during forward/backward:

Weight Update OOM

Occurs during the synchronization of weights from the Trainer to the Inference servers.

Implementation Reference

Sequence Packing and Padding

Memory usage is tightly coupled with how sequences are handled in the DataUtils. The pad_sequences_to_tensors function manages the creation of attention_mask and sequence padding, ensuring that tensors are ready for sharded computation without exceeding the max_length defined in the configuration.

Sources: areal/utils/data.py105-146 docs/en/best_practices/handling_oom.md87-103

Parallelism Constraint Validation

When scaling parallelism to save memory, the following constraints must be respected:

Parallelism TypeRequirementSource
Ulysses (CP)n_heads % cp_size == 0docs/en/best_practices/handling_oom.md126-132
Tensor (TP)n_heads % tp_size == 0docs/en/best_practices/handling_oom.md126-132
Expert (EP)num_experts % (strided_shard_degree * shard_degree) == 0areal/experimental/models/archon/moe_weight_converter.py105-115

Sources: docs/en/best_practices/handling_oom.md120-132 areal/experimental/models/archon/moe_weight_converter.py103-120

Checkpoint and Recovery Memory

During checkpointing, memory pressure can spike. AReaL provides AsyncCheckpointManager (primarily for ArchonEngine) to stage checkpoints to CPU/disk asynchronously, minimizing training pauses and memory spikes. RecoverHandler manages the restoration of state, including dataloader_info, which is all-gathered across ranks to ensure continuity after an OOM-induced crash.

Sources: areal/utils/saver.py17-34 areal/utils/recover.py41-94


Page Sources: areal/api/cli_args.py areal/engine/fsdp_utils/__init__.py areal/engine/fsdp_utils/optimizer.py areal/experimental/models/archon/moe_weight_converter.py areal/utils/data.py areal/utils/saver.py areal/utils/recover.py docs/en/best_practices/handling_oom.md docs/zh/best_practices/handling_oom.md tests/test_per_layer_optim_step.py