Last indexed: 7 May 2026 (2e12c1)

Memory and OOM Issues

This page provides a comprehensive technical guide to diagnosing and resolving out-of-memory (OOM) errors in AReaL. It covers memory management strategies across generation, training, and weight synchronization, focusing on the implementation details of FSDPEngine, ArchonEngine, and MegatronEngine backends.

Memory Usage Overview

AReaL's memory footprint is dynamic, peaking at different stages of the RL loop. Understanding the data flow between components is critical for pinpointing which process (Inference or Training) is exceeding GPU limits.

System Memory Architecture and Code Entities:

Sources: areal/api/cli_args.py18 docs/en/best_practices/handling_oom.md6-39

Core Memory Parameters

Parameter	Code Entity	Impact
Micro-batch Tokens	`MicroBatchSpec.max_tokens_per_mb`	Primary control for training activation memory. areal/api/cli_args.py18
Concurrent Rollouts	`max_concurrent_rollouts`	Controls inference server KV cache pressure. docs/en/best_practices/handling_oom.md45-54
Max Length	`train_dataset.max_length`	Defines the minimum possible memory footprint per sequence. docs/en/best_practices/handling_oom.md15-17
Offload Policy	`CPUOffloadPolicy`	Determines if parameters/optimizer states stay on CPU. areal/engine/fsdp_utils/__init__.py10-13

Sources: areal/api/cli_args.py18 docs/en/best_practices/handling_oom.md6-39 areal/engine/fsdp_utils/__init__.py10-13

Training Memory Optimizations

1. Per-Layer Optimizer Step

Standard FSDP with offload_params: true performs optimizer updates on the CPU, which is slow. AReaL implements PerLayerOptimWrapper to stream optimizer states (momentum/variance) to the GPU one layer at a time. This is compatible with both offload_params: true and false.

Data Flow for Per-Layer Updates:

Sources: areal/engine/fsdp_utils/optimizer.py44-101 docs/en/best_practices/handling_oom.md167-181 tests/test_per_layer_optim_step.py124-145

2. Memory-Efficient Model Loading

Large models often OOM during from_pretrained if every rank attempts to load the full weights. AReaL uses a tiered initialization strategy:

Meta-device Init: Model structure is created without allocating weight memory.
FSDP Wrapping: apply_fsdp2 shards the meta-tensors across the DeviceMesh. areal/engine/fsdp_utils/__init__.py62-108
Rank-0 Broadcast: Only Rank 0 loads the weights from disk and broadcasts them via NCCL using fsdp2_load_full_state_dict. areal/engine/fsdp_utils/__init__.py110-141

Sources: areal/engine/fsdp_utils/__init__.py62-141 docs/en/best_practices/handling_oom.md203-223

3. MoE Expert Sharding

For Mixture-of-Experts (MoE) models, AReaL's ArchonEngine uses specialized converters to prevent memory spikes. MoEWeightConverter in Archon calculates local_experts_indices based on DTensor placements to perform sharded loads instead of full expert concatenation.

Sources: areal/experimental/models/archon/moe_weight_converter.py46-61 areal/experimental/models/archon/moe_weight_converter.py124-146

Diagnosing and Resolving OOM

Generation (Inference) OOM

If the inference backend (SGLang/vLLM) crashes with OOM:

Action: Reduce max_concurrent_rollouts. docs/en/best_practices/handling_oom.md45-54
Action: Increase Tensor Parallelism (TP) in rollout.backend (e.g., sglang:d2t2). docs/en/best_practices/handling_oom.md56-70
Action: Adjust sglang.mem_fraction_static (e.g., set to 0.8). docs/en/best_practices/handling_oom.md71-80

Training OOM

If the training process crashes during forward/backward:

Action: Decrease actor.mb_spec.max_tokens_per_mb. This is the primary parameter for controlling training memory. docs/en/best_practices/handling_oom.md87-95
Action: Enable gradient_checkpointing: true. docs/en/best_practices/handling_oom.md105-110
Action: Use 5D Parallelism. For long context, apply Ulysses sequence parallelism (fsdp:d2c2) or Pipeline Parallelism (archon:d2p2). docs/en/best_practices/handling_oom.md112-157
Action: Switch to a low-precision optimizer. AnyPrecisionAdamW can be configured with momentum_dtype="bfloat16" and variance_dtype="bfloat16". areal/engine/fsdp_utils/optimizer.py44-56 docs/en/best_practices/handling_oom.md189-203

Weight Update OOM

Occurs during the synchronization of weights from the Trainer to the Inference servers.

Action: Change actor.weight_update_mode to disk. This avoids the NCCL buffer overhead. docs/zh/best_practices/handling_oom.md204-214
Action: If using NCCL, reduce the weight_chunked_mem_mb setting in WeightUpdateMeta to decrease the size of the communication buffer. docs/zh/best_practices/handling_oom.md215-226

Implementation Reference

Sequence Packing and Padding

Memory usage is tightly coupled with how sequences are handled in the DataUtils. The pad_sequences_to_tensors function manages the creation of attention_mask and sequence padding, ensuring that tensors are ready for sharded computation without exceeding the max_length defined in the configuration.

Sources: areal/utils/data.py105-146 docs/en/best_practices/handling_oom.md87-103

Parallelism Constraint Validation

When scaling parallelism to save memory, the following constraints must be respected:

Parallelism Type	Requirement	Source
Ulysses (CP)	`n_heads % cp_size == 0`	docs/en/best_practices/handling_oom.md126-132
Tensor (TP)	`n_heads % tp_size == 0`	docs/en/best_practices/handling_oom.md126-132
Expert (EP)	`num_experts % (strided_shard_degree * shard_degree) == 0`	areal/experimental/models/archon/moe_weight_converter.py105-115

Sources: docs/en/best_practices/handling_oom.md120-132 areal/experimental/models/archon/moe_weight_converter.py103-120

Checkpoint and Recovery Memory

During checkpointing, memory pressure can spike. AReaL provides AsyncCheckpointManager (primarily for ArchonEngine) to stage checkpoints to CPU/disk asynchronously, minimizing training pauses and memory spikes. RecoverHandler manages the restoration of state, including dataloader_info, which is all-gathered across ranks to ensure continuity after an OOM-induced crash.

Sources: areal/utils/saver.py17-34 areal/utils/recover.py41-94

Page Sources: areal/api/cli_args.py areal/engine/fsdp_utils/__init__.py areal/engine/fsdp_utils/optimizer.py areal/experimental/models/archon/moe_weight_converter.py areal/utils/data.py areal/utils/saver.py areal/utils/recover.py docs/en/best_practices/handling_oom.md docs/zh/best_practices/handling_oom.md tests/test_per_layer_optim_step.py

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/16.2-memory-and-oom-issues