Last indexed: 7 May 2026 (2e12c1)

Troubleshooting and Best Practices

This page provides a high-level overview of common issues encountered when running AReaL, strategies for debugging distributed RL training, and best practices for optimizing performance. For detailed guides on specific categories, please refer to the child pages linked in each section.

System Architecture Overview

Understanding the interaction between the Trainer, Inference Engines, and the Scheduler is critical for troubleshooting. The following diagram bridges the natural language concepts to the specific code entities responsible for system execution.

System Entity Mapping

The diagram below maps high-level system components to their primary implementation classes and files.

Sources: areal/api/base.py16-23 areal/utils/recover.py151-166

Common Configuration Errors

Configuration in AReaL is handled through a hierarchical dataclass system. Errors typically arise from mismatched allocation_mode strings or invalid micro-batch specifications.

Allocation Mismatches: Ensure that the allocation_mode string correctly identifies the number of nodes and GPUs. For example, a mismatch between the requested world size and the actual available resources will cause hangs during process group initialization.
Checkpoint Paths: Incorrect fileroot or experiment_name in SaverConfig can lead to failures in Saver.get_save_root areal/utils/saver.py35-45
Recover Failures: If RecoverConfig points to an invalid directory, the system will raise InValidRecoverInfo areal/utils/recover.py36-37 The RecoverHandler expects a specific structure containing step_info.json, saver_info.json, and dataloader_info.pkl areal/utils/recover.py70-93

For a comprehensive list of configuration pitfalls, see Common Configuration Errors.

Sources: areal/utils/saver.py24-34 areal/utils/recover.py96-149 areal/utils/recover.py70-93

Memory and OOM Issues

Out-of-Memory (OOM) errors are the most common failure mode in distributed LLM training. AReaL provides several mechanisms to mitigate this, managed primarily within the TrainEngine and configuration layers.

Memory Optimization Flow

The following diagram illustrates how memory-saving features relate to the training backend.

Gradient Checkpointing: Essential for long-context training, reducing activation memory docs/en/best_practices/handling_oom.md105-110
Per-layer Optimizer: The PerLayerOptimWrapper reduces peak memory by streaming optimizer states and updating weights layer-by-layer areal/engine/fsdp_utils/optimizer.py19-40 It supports AdamKernel and OptimKernel for efficient updates areal/engine/fsdp_utils/__init__.py32-39 It is compatible with FSDP CPU offloading to speed up updates while maintaining low memory docs/en/best_practices/handling_oom.md167-181 Correctness is verified by comparing snapshots against baseline optimizers tests/test_per_layer_optim_step.py90-103
Micro-batching: Controlled by MicroBatchSpec, allowing users to fit large global batches into limited VRAM by adjusting max_tokens_per_mb docs/en/best_practices/handling_oom.md22-24
Memory Efficient Loading: Enabling memory_efficient_load in FSDP allows loading large models by broadcasting from rank 0 instead of loading full weights on every rank docs/en/best_practices/handling_oom.md203-213
Lightweight Optimizers: Switching from AdamW to SGD or AdamW_bf16 can significantly reduce memory overhead docs/en/best_practices/handling_oom.md189-201 AnyPrecisionAdamW allows using bfloat16 for momentum and variance to save states memory areal/engine/fsdp_utils/optimizer.py44-56

For details on diagnosing and resolving OOMs, see Memory and OOM Issues.

Sources: areal/engine/fsdp_utils/optimizer.py1-101 docs/en/best_practices/handling_oom.md6-213 areal/engine/fsdp_utils/__init__.py32-39 tests/test_per_layer_optim_step.py146-197

Performance Optimization

Maximizing throughput in AReaL requires balancing the asynchronous rollout generation with the training steps.

Asynchronous Overlap: AReaL's core innovation is overlapping rollout (inference) with training. If the InferenceEngine is idle for too long, it indicates a bottleneck.
Weight Synchronization: The weight_update_mode (e.g., disk vs. nccl) significantly impacts step latency docs/zh/best_practices/handling_oom.md204-211
Tracing: Use the PerfTracer for method-level execution timing and SessionTracer for high-level rollout/train lifecycle tracking areal/utils/perf_tracer.py62-94 The system uses PerfTraceCategory (e.g., COMPUTE, COMM, IO) to organize these traces areal/utils/perf_tracer.py62-85
Parallelism Tuning: Using higher-dimensional parallelism (e.g., Pipeline Parallelism p or Expert Parallelism e in Archon) can improve throughput for large models docs/en/best_practices/handling_oom.md144-154 MoE models utilize MoEWeightConverter to handle 3D parameter sharding efficiently across Expert Parallelism (EP) and Tensor Parallelism (TP) areal/experimental/models/archon/moe_weight_converter.py46-61

Performance Monitoring Tools

Tool	Purpose	File Path
`PerfTracer`	Method-level execution timing	areal/utils/perf_tracer.py28-94
`perf_trace_converter`	Convert JSONL traces to Chrome/Perfetto format	areal/tools/perf_trace_converter.py121-215
`plot_session_trace`	Visualization of async overlap and phase durations	areal/tools/plot_session_trace.py20-42

For best practices on maximizing throughput, see Performance Optimization Guide.

Sources: areal/utils/perf_tracer.py1-94 areal/tools/perf_trace_converter.py1-13 areal/tools/plot_session_trace.py1-45 docs/en/best_practices/handling_oom.md144-154 areal/experimental/models/archon/moe_weight_converter.py125-146

Debugging Distributed Components

When training hangs or crashes in a distributed environment:

Check Traces: Use perf_trace_converter.py to merge traces from multiple ranks and identify which rank is lagging or hung by extracting rank and role metadata areal/tools/perf_trace_converter.py142-160
Verify Shared Storage: WeightUpdateMeta and Saver rely on consistent paths across the cluster. Use Saver.get_save_root to verify the checkpointing directory is accessible by all nodes areal/utils/saver.py36-45
Recovery State: The RecoverInfo class manages the persistence of StepInfo, saver_info, and dataloader_info across restarts areal/utils/recover.py40-51 Ensure RecoverInfo is dumped correctly by rank 0 to avoid file contention areal/utils/recover.py64-66
Distributed Tests: Use specialized test suites (e.g., run_ep_tests.py for Expert Parallelism or test_per_layer_optim_step.py for optimizer streaming) to verify numerical correctness and weight synchronization across different parallel configurations tests/experimental/archon/torchrun/run_ep_tests.py1-25 tests/test_per_layer_optim_step.py1-15

Sources: areal/utils/recover.py40-69 areal/tools/perf_trace_converter.py30-55 tests/experimental/archon/torchrun/run_ep_tests.py1-25 tests/test_per_layer_optim_step.py146-197

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/16-troubleshooting-and-best-practices