Last indexed: 7 May 2026 (2e12c1)

Debugging Distributed Training

Debugging distributed RL training presents unique challenges, including asynchronous process hangs, NCCL/XCCL deadlocks, and complex communication patterns between trainers and inference servers. AReaL provides a suite of tools and integrated patterns to diagnose these issues across different schedulers and backends.

Distributed Communication and Hangs

In distributed environments, a "hang" often occurs when one process in a collective communication group (NCCL/XCCL) fails to reach a synchronization point, causing all other participants to wait indefinitely.

Diagnosing with py-spy

AReaL's infrastructure is designed to be compatible with py-spy, a sampling profiler for Python programs. When a distributed job hangs, py-spy dump can be used on individual worker processes to inspect the stack trace and identify if a process is stuck in a dist.all_reduce, dist.broadcast, or an RPC wait.

Performance Tracing and Bottlenecks

AReaL includes a PerfTracer utility to record high-resolution timestamps for critical operations like computation, communication, and I/O areal/utils/perf_tracer.py62-94

Categories: Events are classified into categories like COMPUTE, COMM, IO, SYNC, and SCHEDULER areal/utils/perf_tracer.py87-93
Trace Visualization: Use areal/tools/perf_trace_converter.py to convert the resulting traces.jsonl files into a Chrome Tracing compatible format for visualization in chrome://tracing or Perfetto areal/tools/perf_trace_converter.py121-127 The converter handles remapping process and thread IDs to ensure unique identifiers across distributed ranks areal/tools/perf_trace_converter.py121-136
Session Tracing: For agent workflows, SessionTracer tracks the lifecycle of individual rollout episodes, including generation, reward computation, and tool-call phases areal/utils/perf_tracer.py118-121 areal/tools/plot_session_trace.py24-29

Sources: areal/utils/perf_tracer.py62-94 areal/tools/perf_trace_converter.py121-136 areal/tools/plot_session_trace.py24-29

Logging and Monitoring Infrastructure

AReaL implements a hierarchical, color-coded logging system to help distinguish between different distributed components in a merged log file areal/utils/logging.py18-21

Component-Based Coloring

The LoggerColoredFormatter assigns specific colors to different system components based on exact name matches or prefix patterns areal/utils/logging.py159-184

Component Category	Color	Included Entities
Infrastructure	Blue	`LocalScheduler`, `RayScheduler`, `SlurmLauncher`, `Saver` areal/utils/logging.py40-46 areal/utils/logging.py91-93
Orchestration	White	`TrainController`, `RolloutController`, `SyncRPCServer`, `SGLangWrapper`, `VLLMWrapper` areal/utils/logging.py53-67 areal/infra/launcher/vllm_server.py31
RL Logic	Purple	`RLVRWorkflow`, `RewardAPI`, `ArealOpenAI`, `AgentGateway` areal/utils/logging.py48-51 areal/utils/logging.py105-117
Data/Metrics	Green	`StatsLogger`, `PerfTracer`, `Dataset`, `RLTrainer` areal/utils/logging.py57-75
Compute Backends	Cyan	`FSDPEngine`, `MegatronEngine`, `PPOActor`, `Platform` areal/utils/logging.py131-134 areal/utils/logging.py77 areal/utils/logging.py95-99

Sources: areal/utils/logging.py38-184 areal/infra/launcher/vllm_server.py31

Stats Logging and Tracking

The StatsLogger and StatsTracker provide the infrastructure for distributed metric aggregation areal/utils/logging.py57-58

Rank Filtering: To prevent log flooding, typically only the global rank 0 process initializes external connections (like WandB) and commits data.
Real-time Streaming: The StreamingFileHandler flushes after each log message to ensure logs are visible in real-time during a hang areal/utils/logging.py151-156

Sources: areal/utils/logging.py57-156

Debugging Workflow and Scheduler Interactions

The interaction between the Scheduler and the Worker processes is managed via the RolloutController, which orchestrates worker lifecycle and task dispatching areal/infra/controller/rollout_controller.py72-81

Process Lifecycle and Engine Initialization

The RolloutController initializes workers by defining a Job with specific SchedulingSpec requirements (CPU, GPU, Memory) areal/infra/controller/rollout_controller.py181-199

Distributed Startup and Handshake Sequence

Sources: areal/infra/controller/rollout_controller.py152-204

Launcher-Specific Debugging

Each launcher implementation provides different levels of process control and monitoring:

LocalLauncher: Uses psutil to track job states and can terminate process trees via terminate_process_and_children areal/infra/launcher/local.py72-85 It maps OS process statuses to JobState areal/infra/launcher/local.py43-63
RayLauncher: Utilizes PlacementGroup to ensure resources are co-located areal/infra/launcher/ray.py186-204 Debugging involves checking the ray dashboard and verifying PlacementGroupSchedulingStrategy areal/infra/launcher/ray.py115-123
SlurmLauncher: Generates .sh sbatch scripts and monitors jobs via squeue (query_jobs) areal/infra/launcher/slurm.py79-83 areal/infra/launcher/slurm.py40

Sources: areal/infra/launcher/local.py43-85 areal/infra/launcher/ray.py115-204 areal/infra/launcher/slurm.py40-83

Debugging Agent Workflows

Agentic RL workflows involving ArealOpenAI often use a ProxyRolloutServer to act as an OpenAI-compatible gateway areal/experimental/openai/proxy/proxy_rollout_server.py1-63

Persistent Inference Servers

When debugging agents, it is common to launch standalone inference servers to verify connectivity and generation quality. AReaL provides wrappers for this:

SGLangServerWrapper: Launches SGLang servers and polls the /v1/models endpoint to wait for readiness areal/infra/launcher/sglang_server.py65-88 areal/infra/launcher/sglang_server.py89-105
vLLMServerWrapper: Similar wrapper for vLLM, including graceful shutdown handlers for SIGTERM/SIGINT areal/infra/launcher/vllm_server.py108-124

Data Flow: Natural Language to Code Entities

The following diagram maps the logical flow of an agent interaction through the proxy and inference engine.

Agent Interaction Data Flow

Sources: areal/experimental/openai/proxy/proxy_rollout_server.py40-96 areal/infra/remote_inf_engine.py67-142 areal/infra/launcher/sglang_server.py65-105 areal/infra/launcher/vllm_server.py108-124

Best Practices for Troubleshooting

Check traces.jsonl: If performance is degraded, use PerfTracer to identify which category (COMM, COMPUTE, IO) is consuming the most time areal/utils/perf_tracer.py87-93
Interactive Timeline: Use areal/tools/plot_session_trace.py to generate Plotly-based HTML visualizations of session lifecycles areal/tools/plot_session_trace.py16-18
Process Group Warmup: On NPU/HCCL platforms, ensure warmup_process_groups is called to avoid lazy initialization races that lead to error code 7 areal/engine/core/distributed.py26-42
Network Probing: Use gethostip and find_free_ports to diagnose binding issues in multi-node setups areal/utils/network.py12-25 areal/utils/network.py114-134
Weight Sync Validation: Check RemoteInfEngine logs for successful weight updates via disk or NCCL areal/infra/remote_inf_engine.py177-214

Sources: areal/utils/perf_tracer.py87-93 areal/engine/core/distributed.py26-42 areal/utils/network.py12-134 areal/infra/remote_inf_engine.py177-214

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/16.3-debugging-distributed-training