Last indexed: 7 May 2026 (2e12c1)

Memory Management

This page describes memory optimization techniques and memory management strategies in AReaL's training engines. It covers configuration parameters, optimization methods for training and inference, and memory-efficient weight synchronization mechanisms.

For configuration details, see 2.4 Training Engine Configurations For parallelism strategies that affect memory, see 8. Parallelism and Distribution For troubleshooting OOM errors, see 16.2 Memory and OOM Issues

Memory Optimization Overview

AReaL provides multiple complementary memory optimization strategies across three phases of the training pipeline:

Phase	Optimization Techniques	Primary Configuration
Inference/Rollout	Memory fraction tuning, Parallelism adjustment, Concurrent rollout limiting	`sglang.mem_fraction_static`, `max_concurrent_rollouts`, `allocation_mode`
Training/Gradient	Gradient checkpointing, Micro-batch sizing, CPU offloading, Per-layer optimizer step	`gradient_checkpointing`, `mb_spec.max_tokens_per_mb`, `offload_params`, `per_layer_optim_step`
Weight Updates	Disk-based sync, Chunked NCCL transfers	`weight_update_mode`, `weight_chunked_mem_mb`

Sources: docs/en/best_practices/handling_oom.md1-40 docs/en/best_practices/handling_oom.md200-226

Core Memory Parameters

Critical Memory Configuration

The following parameters have the most significant impact on peak memory usage:

Micro-batch Token Limit (actor.mb_spec.max_tokens_per_mb):

Controls the maximum number of tokens processed in a single forward/backward pass docs/en/best_practices/handling_oom.md22-24
Directly determines training activation memory.
Must be ≥ max_length + max_new_tokens (sequence length constraint) docs/en/best_practices/handling_oom.md24
Lower values reduce memory but may decrease throughput. Defined in MicroBatchSpec areal/api/cli_args.py115-120

Concurrent Rollout Limit (max_concurrent_rollouts):

Number of parallel generation requests sent to inference engines docs/en/best_practices/handling_oom.md26-27
Most effective parameter for controlling inference memory pressure docs/en/best_practices/handling_oom.md45-54
Higher values improve throughput but increase KV cache memory.

Allocation Mode (allocation_mode):

Defines parallelism strategy across GPUs (e.g., sglang:d2t2+fsdp:d2t2) docs/en/best_practices/handling_oom.md12-14
Tensor parallelism (TP) typically uses less memory per GPU than data parallelism (DP) by sharding model weights docs/en/best_practices/handling_oom.md13-14

Inference Memory Fraction (sglang.mem_fraction_static):

Fraction of GPU memory allocated for SGLang KV cache docs/en/best_practices/handling_oom.md31-32
Default: 0.9, reduce to 0.8 or lower if OOM occurs docs/en/best_practices/handling_oom.md76-78

Sources: docs/en/best_practices/handling_oom.md8-40 docs/en/best_practices/handling_oom.md45-80 areal/api/cli_args.py99-126

Memory Parameter Interaction

"Natural Language Config" to "Code Entity" mapping:

Title: Configuration to Code Entity Mapping

Sources: docs/en/best_practices/handling_oom.md8-40 areal/api/cli_args.py99-126 areal/engine/fsdp_engine.py203

Training Memory Optimizations

Gradient Checkpointing

Gradient checkpointing reduces activation memory by recomputing activations during the backward pass instead of storing them docs/en/best_practices/handling_oom.md105-106

Configuration:

Implementation:

FSDPEngine: Gradient checkpointing is enabled via the training configuration and handled by the underlying PyTorch FSDP2 implementation areal/engine/fsdp_engine.py203-210
ArchonEngine: Supports activation checkpointing via ActivationCheckpointConfig areal/experimental/engine/archon_engine.py76-78
MegatronEngine: Inherits Megatron-Core's recompute features through the TransformerConfig areal/engine/megatron_engine.py29-31

Sources: docs/en/best_practices/handling_oom.md105-110 areal/engine/fsdp_engine.py203-210 areal/experimental/engine/archon_engine.py76-78 areal/engine/megatron_engine.py29-31

CPU Offloading

CPU offloading moves parameters and optimizer states to CPU memory, streaming them to GPU only when needed.

Configuration:

Memory Flow:

Title: Parameter and State Memory Flow

Sources: docs/en/best_practices/handling_oom.md173-181 areal/engine/fsdp_engine.py203 areal/engine/fsdp_utils/optimizer.py320

Per-Layer Optimizer Step

When optimizer states are offloaded to CPU, the default CPU-based optimizer step is slow. PerLayerOptimWrapper areal/engine/fsdp_utils/optimizer.py320 streams optimizer states per-layer to GPU for updates.

Configuration:

Key Implementation Details:

Layer Grouping areal/engine/fsdp_utils/optimizer.py383-414:
- Identifies all FSDPModule submodules (transformer layers) using fully_shard_module.FSDPModule areal/engine/fsdp_utils/optimizer.py383
- Groups parameters by module to enable per-layer streaming.
Pipelined Execution areal/engine/fsdp_utils/optimizer.py581-631:
- Uses independent CUDA streams: h2d_stream, compute_stream, d2h_stream.
- Prefetches prefetch_layers ahead while the current layer computes areal/engine/fsdp_utils/optimizer.py611-615
Optimizer Kernel areal/engine/fsdp_utils/optimizer.py214-318:
- OptimKernel abstraction supports different optimizers.
- AdamKernel uses PyTorch's internal _adam_fn areal/engine/fsdp_utils/optimizer.py14 for performance.

Sources: docs/en/best_practices/handling_oom.md167-181 areal/engine/fsdp_utils/optimizer.py320-632

Memory-Efficient Model Loading

Standard model loading loads full weights onto each GPU before sharding. Memory-efficient loading avoids this by using rank-0 broadcast.

Configuration:

Implementation: The function fsdp2_load_full_state_dict() in areal/engine/fsdp_utils/__init__.py implements this:

Uses StateDictOptions(broadcast_from_rank0=True) areal/engine/fsdp_utils/__init__.py141-146
Rank 0 loads pretrained weights and broadcasts to all other ranks docs/en/best_practices/handling_oom.md219-220
For Vision-Language Models (VLMs), weights are loaded on CPU rather than GPU per rank to reduce peak GPU memory during initialization docs/en/best_practices/handling_oom.md197-199

Sources: docs/en/best_practices/handling_oom.md203-223 areal/engine/fsdp_utils/__init__.py110-161

Precision Reduction for Optimizer States

AnyPrecisionAdamW areal/engine/fsdp_utils/optimizer.py44 allows configuring the precision of optimizer momentum and variance states.

Configuration:

Implementation:

Supports configurable dtypes for momentum (momentum_dtype) and variance (variance_dtype) areal/engine/fsdp_utils/optimizer.py53-54
Optional Kahan summation areal/engine/fsdp_utils/optimizer.py52 for high-precision updates with low-precision states.
States are initialized with the target precision areal/engine/fsdp_utils/optimizer.py144-153

Sources: docs/en/best_practices/handling_oom.md189-201 areal/engine/fsdp_utils/optimizer.py44-189

Weight Synchronization Memory

Weight updates between training and inference engines can cause memory spikes during the transfer phase.

Disk-Based Weight Updates

Configuration:

Mechanism:

Training engine saves weights to a shared directory (cluster.fileroot) docs/en/best_practices/handling_oom.md210-213
Inference engine loads weights from disk. This eliminates NCCL transfer buffers in GPU memory.
ArchonEngine implements weight updates via update_weights_from_disk areal/experimental/engine/archon_weight_sync.py65
FSDPEngine handles versioned weight updates via WeightUpdateMeta areal/engine/fsdp_engine.py55

NCCL-Based Weight Updates with Chunking

If using NCCL (default), memory can be managed by reducing the transfer buffer size:

weight_chunked_mem_mb: Controls the size of the memory buffer used for weight chunking during transfer docs/en/best_practices/handling_oom.md223-224
ArchonEngine uses update_weights_from_distributed for this areal/experimental/engine/archon_weight_sync.py66

Sources: docs/en/best_practices/handling_oom.md200-226 areal/experimental/engine/archon_weight_sync.py62-67

Decision Tree for Memory Optimization

Title: OOM Resolution Decision Logic

Sources: docs/en/best_practices/handling_oom.md1-226

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/3.8-memory-management