VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/16.3-debugging-distributed-training

⇱ Debugging Distributed Training | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Debugging Distributed Training

Debugging distributed RL training presents unique challenges, including asynchronous process hangs, NCCL/XCCL deadlocks, and complex communication patterns between trainers and inference servers. AReaL provides a suite of tools and integrated patterns to diagnose these issues across different schedulers and backends.

Distributed Communication and Hangs

In distributed environments, a "hang" often occurs when one process in a collective communication group (NCCL/XCCL) fails to reach a synchronization point, causing all other participants to wait indefinitely.

Diagnosing with py-spy

AReaL's infrastructure is designed to be compatible with py-spy, a sampling profiler for Python programs. When a distributed job hangs, py-spy dump can be used on individual worker processes to inspect the stack trace and identify if a process is stuck in a dist.all_reduce, dist.broadcast, or an RPC wait.

Performance Tracing and Bottlenecks

AReaL includes a PerfTracer utility to record high-resolution timestamps for critical operations like computation, communication, and I/O areal/utils/perf_tracer.py62-94

Sources: areal/utils/perf_tracer.py62-94 areal/tools/perf_trace_converter.py121-136 areal/tools/plot_session_trace.py24-29

Logging and Monitoring Infrastructure

AReaL implements a hierarchical, color-coded logging system to help distinguish between different distributed components in a merged log file areal/utils/logging.py18-21

Component-Based Coloring

The LoggerColoredFormatter assigns specific colors to different system components based on exact name matches or prefix patterns areal/utils/logging.py159-184

Component CategoryColorIncluded Entities
InfrastructureBlueLocalScheduler, RayScheduler, SlurmLauncher, Saver areal/utils/logging.py40-46 areal/utils/logging.py91-93
OrchestrationWhiteTrainController, RolloutController, SyncRPCServer, SGLangWrapper, VLLMWrapper areal/utils/logging.py53-67 areal/infra/launcher/vllm_server.py31
RL LogicPurpleRLVRWorkflow, RewardAPI, ArealOpenAI, AgentGateway areal/utils/logging.py48-51 areal/utils/logging.py105-117
Data/MetricsGreenStatsLogger, PerfTracer, Dataset, RLTrainer areal/utils/logging.py57-75
Compute BackendsCyanFSDPEngine, MegatronEngine, PPOActor, Platform areal/utils/logging.py131-134 areal/utils/logging.py77 areal/utils/logging.py95-99

Sources: areal/utils/logging.py38-184 areal/infra/launcher/vllm_server.py31

Stats Logging and Tracking

The StatsLogger and StatsTracker provide the infrastructure for distributed metric aggregation areal/utils/logging.py57-58

  • Rank Filtering: To prevent log flooding, typically only the global rank 0 process initializes external connections (like WandB) and commits data.
  • Real-time Streaming: The StreamingFileHandler flushes after each log message to ensure logs are visible in real-time during a hang areal/utils/logging.py151-156

Sources: areal/utils/logging.py57-156

Debugging Workflow and Scheduler Interactions

The interaction between the Scheduler and the Worker processes is managed via the RolloutController, which orchestrates worker lifecycle and task dispatching areal/infra/controller/rollout_controller.py72-81

Process Lifecycle and Engine Initialization

The RolloutController initializes workers by defining a Job with specific SchedulingSpec requirements (CPU, GPU, Memory) areal/infra/controller/rollout_controller.py181-199

Distributed Startup and Handshake Sequence


Sources: areal/infra/controller/rollout_controller.py152-204

Launcher-Specific Debugging

Each launcher implementation provides different levels of process control and monitoring:

Sources: areal/infra/launcher/local.py43-85 areal/infra/launcher/ray.py115-204 areal/infra/launcher/slurm.py40-83

Debugging Agent Workflows

Agentic RL workflows involving ArealOpenAI often use a ProxyRolloutServer to act as an OpenAI-compatible gateway areal/experimental/openai/proxy/proxy_rollout_server.py1-63

Persistent Inference Servers

When debugging agents, it is common to launch standalone inference servers to verify connectivity and generation quality. AReaL provides wrappers for this:

Data Flow: Natural Language to Code Entities

The following diagram maps the logical flow of an agent interaction through the proxy and inference engine.

Agent Interaction Data Flow


Sources: areal/experimental/openai/proxy/proxy_rollout_server.py40-96 areal/infra/remote_inf_engine.py67-142 areal/infra/launcher/sglang_server.py65-105 areal/infra/launcher/vllm_server.py108-124

Best Practices for Troubleshooting

  1. Check traces.jsonl: If performance is degraded, use PerfTracer to identify which category (COMM, COMPUTE, IO) is consuming the most time areal/utils/perf_tracer.py87-93
  2. Interactive Timeline: Use areal/tools/plot_session_trace.py to generate Plotly-based HTML visualizations of session lifecycles areal/tools/plot_session_trace.py16-18
  3. Process Group Warmup: On NPU/HCCL platforms, ensure warmup_process_groups is called to avoid lazy initialization races that lead to error code 7 areal/engine/core/distributed.py26-42
  4. Network Probing: Use gethostip and find_free_ports to diagnose binding issues in multi-node setups areal/utils/network.py12-25 areal/utils/network.py114-134
  5. Weight Sync Validation: Check RemoteInfEngine logs for successful weight updates via disk or NCCL areal/infra/remote_inf_engine.py177-214

Sources: areal/utils/perf_tracer.py87-93 areal/engine/core/distributed.py26-42 areal/utils/network.py12-134 areal/infra/remote_inf_engine.py177-214