VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/16-troubleshooting-and-best-practices

⇱ Troubleshooting and Best Practices | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Troubleshooting and Best Practices

This page provides a high-level overview of common issues encountered when running AReaL, strategies for debugging distributed RL training, and best practices for optimizing performance. For detailed guides on specific categories, please refer to the child pages linked in each section.

System Architecture Overview

Understanding the interaction between the Trainer, Inference Engines, and the Scheduler is critical for troubleshooting. The following diagram bridges the natural language concepts to the specific code entities responsible for system execution.

System Entity Mapping

The diagram below maps high-level system components to their primary implementation classes and files.


Sources: areal/api/base.py16-23 areal/utils/recover.py151-166

Common Configuration Errors

Configuration in AReaL is handled through a hierarchical dataclass system. Errors typically arise from mismatched allocation_mode strings or invalid micro-batch specifications.

  • Allocation Mismatches: Ensure that the allocation_mode string correctly identifies the number of nodes and GPUs. For example, a mismatch between the requested world size and the actual available resources will cause hangs during process group initialization.
  • Checkpoint Paths: Incorrect fileroot or experiment_name in SaverConfig can lead to failures in Saver.get_save_root areal/utils/saver.py35-45
  • Recover Failures: If RecoverConfig points to an invalid directory, the system will raise InValidRecoverInfo areal/utils/recover.py36-37 The RecoverHandler expects a specific structure containing step_info.json, saver_info.json, and dataloader_info.pkl areal/utils/recover.py70-93

For a comprehensive list of configuration pitfalls, see Common Configuration Errors.

Sources: areal/utils/saver.py24-34 areal/utils/recover.py96-149 areal/utils/recover.py70-93

Memory and OOM Issues

Out-of-Memory (OOM) errors are the most common failure mode in distributed LLM training. AReaL provides several mechanisms to mitigate this, managed primarily within the TrainEngine and configuration layers.

Memory Optimization Flow

The following diagram illustrates how memory-saving features relate to the training backend.



For details on diagnosing and resolving OOMs, see Memory and OOM Issues.

Sources: areal/engine/fsdp_utils/optimizer.py1-101 docs/en/best_practices/handling_oom.md6-213 areal/engine/fsdp_utils/__init__.py32-39 tests/test_per_layer_optim_step.py146-197

Performance Optimization

Maximizing throughput in AReaL requires balancing the asynchronous rollout generation with the training steps.

Performance Monitoring Tools

ToolPurposeFile Path
PerfTracerMethod-level execution timingareal/utils/perf_tracer.py28-94
perf_trace_converterConvert JSONL traces to Chrome/Perfetto formatareal/tools/perf_trace_converter.py121-215
plot_session_traceVisualization of async overlap and phase durationsareal/tools/plot_session_trace.py20-42

For best practices on maximizing throughput, see Performance Optimization Guide.

Sources: areal/utils/perf_tracer.py1-94 areal/tools/perf_trace_converter.py1-13 areal/tools/plot_session_trace.py1-45 docs/en/best_practices/handling_oom.md144-154 areal/experimental/models/archon/moe_weight_converter.py125-146

Debugging Distributed Components

When training hangs or crashes in a distributed environment:

  1. Check Traces: Use perf_trace_converter.py to merge traces from multiple ranks and identify which rank is lagging or hung by extracting rank and role metadata areal/tools/perf_trace_converter.py142-160
  2. Verify Shared Storage: WeightUpdateMeta and Saver rely on consistent paths across the cluster. Use Saver.get_save_root to verify the checkpointing directory is accessible by all nodes areal/utils/saver.py36-45
  3. Recovery State: The RecoverInfo class manages the persistence of StepInfo, saver_info, and dataloader_info across restarts areal/utils/recover.py40-51 Ensure RecoverInfo is dumped correctly by rank 0 to avoid file contention areal/utils/recover.py64-66
  4. Distributed Tests: Use specialized test suites (e.g., run_ep_tests.py for Expert Parallelism or test_per_layer_optim_step.py for optimizer streaming) to verify numerical correctness and weight synchronization across different parallel configurations tests/experimental/archon/torchrun/run_ep_tests.py1-25 tests/test_per_layer_optim_step.py1-15

Sources: areal/utils/recover.py40-69 areal/tools/perf_trace_converter.py30-55 tests/experimental/archon/torchrun/run_ep_tests.py1-25 tests/test_per_layer_optim_step.py146-197