VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/13.3-trace-visualization

⇱ Trace Visualization | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Trace Visualization

This page documents AReaL's trace visualization system, which provides performance profiling and session lifecycle tracking for distributed training and inference workloads. The system captures timestamped events during execution and exports them in formats compatible with visualization tools.

For distributed system architecture, see Architecture Overview For performance optimization guidance, see Performance Optimization Guide


Overview

AReaL provides two complementary tracing systems:

SystemPurposeOutput FormatVisualization Tool
PerfTracerLow-level performance profiling with duration scopes and instant markersJSONL → Chrome Trace JSONChrome Tracing (chrome://tracing)
SessionTracerHigh-level session lifecycle tracking with phase breakdowns and metricsJSONLPlotly interactive plots

Both systems write JSONL files to disk with configurable flush intervals, enabling post-training analysis of distributed execution patterns. The tracing infrastructure is designed for minimal overhead and uses context variables to propagate metadata through async execution contexts areal/utils/perf_tracer.py30-40

Sources: areal/utils/perf_tracer.py1-118


Performance Trace System

PerfTracer Architecture

The PerfTracer class captures fine-grained performance events across distributed workers. Each trace event follows the Chrome Trace Event Format with timestamp, duration, category, and metadata.

Trace Event Generation Flow


Trace Event Categories:

The PerfTraceCategory enum areal/utils/perf_tracer.py62-93 defines standard categories used to organize events in visualization tools:

CategoryUsageExamples
COMPUTECPU/GPU computationforward pass, backward pass, loss calculation
COMMDistributed communicationall-reduce, broadcast, NCCL operations
IODisk I/O operationscheckpoint save/load, data loading
SYNCSynchronization barrierstorch.cuda.synchronize(), barriers
SCHEDULERTask schedulingqueue management, worker allocation
INSTRInstrumentation overheadtracing itself
MISCUncategorized eventsgeneral utilities

Sources: areal/utils/perf_tracer.py62-114 areal/utils/perf_tracer.py1602-1950


Session Trace System

SessionTracer Architecture

The SessionTracer tracks high-level session lifecycle events for rollout episodes. Each session represents a single data sample's journey from submission through generation, reward computation, and finalization.

Session Lifecycle and Data Flow


Session Lifecycle Events:

The SessionTraceEvent enum defines lifecycle markers used by the tracer:

Sources: areal/utils/perf_tracer.py244-293 areal/workflow/rlvr.py82-136


SessionRecord Data Structure

The SessionRecord class represents a complete session execution trace with lifecycle timestamps, phase executions, and derived metrics.

Core Fields

FieldTypeDescription
task_idintDataset-level task identifier areal/utils/perf_tracer.py433
session_idintUnique session identifier areal/utils/perf_tracer.py434
rankintProcess rank identifier areal/utils/perf_tracer.py435
rolestr | NoneOptional role (e.g., "rollout", "train") areal/utils/perf_tracer.py436
submit_tsfloatSubmission timestamp (wall-clock) areal/utils/perf_tracer.py437
finalized_tsfloat | NoneFinalization timestamp areal/utils/perf_tracer.py438
statusstr"pending", "accepted", "rejected", "failed", "dropped" areal/utils/perf_tracer.py439
phasesdict[str, list[PhaseSpan]]Phase execution history areal/utils/perf_tracer.py441

Phase Tracking

Each phase (generate, reward, toolcall) can execute multiple times within a session. A PhaseSpan records each execution with start_ts and end_ts areal/utils/perf_tracer.py296-318

Derived Metrics

The SessionRecord.to_dict() method computes derived metrics for analysis areal/utils/perf_tracer.py652-667:

  • total_s: Total session duration (submit to finalized).
  • generate_s: Total time spent in all generate phase executions.
  • reward_s: Total time spent in all reward phase executions.
  • toolcall_s: Total time spent in all tool call phase executions.

Sources: areal/utils/perf_tracer.py296-668


Instrumentation APIs

Performance Trace Instrumentation

Function Decorator: Applied to methods to track execution time automatically.


Defined in areal/utils/perf_tracer.py1602-1750

Context Managers: Used for fine-grained scoping within functions.


Defined in areal/utils/perf_tracer.py1752-1900

Instant Markers: Mark a point-in-time event without duration.


Defined in areal/utils/perf_tracer.py1902-1950

Sources: areal/utils/perf_tracer.py1600-1950

Session Trace Instrumentation

Workflow Usage Example: The following pattern is implemented in RLVRWorkflow areal/workflow/rlvr.py82-136 and VisionRLVRWorkflow areal/workflow/vision_rlvr.py45-101


Key Decorators:

DecoratorPurposeUsage
@session_context()Registers a new session and propagates session_idApplied to top-level workflow methods areal/utils/perf_tracer.py816-860
@trace_session(phase)Wraps a method with phase start/end markersApplied to phase-specific methods areal/utils/perf_tracer.py862-895
atrace_session_phase(phase)Async context manager for phase trackingUsed within session context areal/utils/perf_tracer.py897-918

Sources: areal/workflow/rlvr.py82-136 areal/workflow/vision_rlvr.py45-101 areal/utils/perf_tracer.py816-918


Converting to Chrome Trace Format

perf_trace_converter.py

The convert_jsonl_to_chrome_trace() function areal/tools/perf_trace_converter.py295-438 transforms per-rank JSONL files into a single Chrome Trace JSON file.

Trace Conversion Logic


Key Transformations:

  1. PID/TID Remapping: Original process/thread IDs are replaced with sequential integers grouped by (rank, role, original_pid) areal/tools/perf_trace_converter.py121-215
  2. Metadata Generation: Process and thread names are synthesized from rank/role information areal/tools/perf_trace_converter.py217-230
  3. Event Sorting: Ranks and roles are sorted numerically and alphabetically to ensure consistent visualization ordering areal/tools/perf_trace_converter.py69-118

Sources: areal/tools/perf_trace_converter.py69-438


Visualizing Session Traces

plot_session_trace.py

The plot_session_trace.py script generates interactive Plotly visualizations from session trace JSONL files.

Session Visualization Pipeline


Generated Visualizations:

  1. Overall Distribution: Histograms of duration metrics (total_s, generate_s, reward_s, toolcall_s) areal/tools/plot_session_trace.py24-29
  2. Lifecycle Timeline: Gantt chart showing session execution phases. Each session bar shows idle periods (gray), generate phases (blue), reward phases (red), and tool call phases (orange) areal/tools/plot_session_trace.py31-42
  3. Step Detection Algorithm: The _determine_step_timepoints() function implements global step synchronization by partitioning sessions into steps based on completion times and batch sizes areal/tools/plot_session_trace.py154-215
  4. Interactive Controls: Users can filter by global step using dropdowns or zoom into specific time ranges to inspect concurrent session execution areal/tools/plot_session_trace.py217-250

Sources: areal/tools/plot_session_trace.py20-60 areal/tools/plot_session_trace.py154-250


Configuration

Tracer Configuration

The tracing behavior is controlled via configuration dataclasses areal/api/cli_args.py25:

ParameterTypeDefaultDescription
enableboolFalseEnable performance tracing
save_intervalint100Flush events to disk every N events areal/utils/perf_tracer.py153-154
flush_thresholdint128Flush completed sessions every N sessions areal/utils/perf_tracer.py157-166

Traces are written to a standardized path under fileroot/logs/user/experiment/trial/ with rank-qualified filenames areal/utils/perf_tracer.py199-230

Sources: areal/utils/perf_tracer.py153-230 areal/api/cli_args.py25


Context Variable Propagation

The tracing system uses Python context variables to propagate metadata through async execution:


These variables are automatically inherited by async tasks created within a traced context, enabling task-level correlation of events and global step annotation on performance events.

Sources: areal/utils/perf_tracer.py31-40