VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/13-performance-and-monitoring

⇱ Performance and Monitoring | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Performance and Monitoring

This document describes AReaL's performance monitoring and tracing infrastructure. The system provides two complementary tracing mechanisms: PerfTracer for low-level performance events (compute, communication, I/O) and SessionTracer for high-level session lifecycle tracking (submission, generation, reward computation). These systems enable detailed performance analysis, bottleneck identification, and optimization of distributed RL training workflows.

For algorithm-specific training metrics and logging, see Trainer Orchestration. For memory optimization techniques, see Memory Management.


Tracing System Architecture

AReaL's tracing infrastructure consists of three layers: instrumentation (decorators and context managers), data collection (tracers writing JSONL files), and visualization (conversion and plotting tools).

Performance Tracing Data Flow


Sources: areal/utils/perf_tracer.py31-138 areal/tools/perf_trace_converter.py121-152 areal/tools/plot_session_trace.py20-42


PerfTracer: Low-Level Performance Tracing

The PerfTracer class records fine-grained performance events with microsecond timestamps, process/thread identifiers, and categorical tags. Events are written to rank-qualified JSONL files for distributed training scenarios using _rank_qualified_filename areal/utils/perf_tracer.py142-144

Trace Event Categories

Performance events are classified into categories defined in PerfTraceCategory areal/utils/perf_tracer.py62-93:

CategoryEnum ValueDescriptionTypical Usage
COMPUTEcomputeCPU/GPU computationForward pass, backward pass, loss calculation
COMMcommDistributed communicationAll-reduce, broadcast, NCCL operations
IOioDisk I/O operationsCheckpoint save/load, dataset loading
SYNCsyncSynchronization primitivesBarriers, locks, wait operations
SCHEDULERschedulerTask schedulingQueue management, worker dispatch
INSTRinstrInstrumentation overheadProfiler hooks, measurement code
MISCmiscUncategorized eventsDefault fallback category

Sources: areal/utils/perf_tracer.py62-114

Instrumentation API

The instrumentation API provides multiple ways to hook into the codebase:

  • Decorators: @trace_perf for timing methods and functions.
  • Context Managers: trace_scope (sync) and atrace_scope (async) for block-level timing.
  • Instant Markers: mark_instant for point-in-time events.

Sources: areal/utils/perf_tracer.py121-140 areal/utils/perf_tracer.py199-230


SessionTracer: Lifecycle Event Tracking

The SessionTracer class tracks the complete lifecycle of individual rollout sessions from submission through finalization, recording timestamps for each phase (generate, reward, tool calls) and computing derived performance metrics.

Session Lifecycle Model

Sessions transition through various states tracked by SessionTracer:

  • pending: Registered but not yet executing.
  • accepted: Execution has started.
  • ready: Successfully finalized with all phases complete.
  • rejected/dropped/failed: Terminal error states.

Sources: areal/utils/perf_tracer.py246-295

Phase Tracking

Phases represent distinct execution stages within a session. Each phase can execute multiple times (e.g., multiple generate calls in multi-turn workflows). RLVRWorkflow and VisionRLVRWorkflow use these to track:

  • generate: Model inference time.
  • reward: Reward function execution time.
  • toolcall: Time spent in external tool execution.

Sources: areal/tools/plot_session_trace.py30-42


Stats Logging and Monitoring

The StatsLogger class areal/utils/stats_logger.py23-33 provides high-level metric logging to external providers. It supports multiple backends including Weights & Biases (WandB), SwanLab, TensorBoard, and Trackio.

StatsLogger Infrastructure


Sources: areal/utils/stats_logger.py37-163

Configuration and Metadata

Logging is configured via StatsLoggerConfig areal/api/cli_args.py15 The logger automatically captures version_info including the git commit ID, branch name, and dirty status to ensure experiment traceability areal/utils/stats_logger.py56-61


Trace Visualization Tools

Perf Trace Converter

The perf_trace_converter.py tool converts per-rank JSONL files into a single Chrome Trace JSON file. It uses _remap_process_and_thread_ids areal/tools/perf_trace_converter.py121-131 to ensure that processes and threads from different ranks are displayed distinctly in tools like Perfetto or chrome://tracing.

Session Trace Plotter

The plot_session_trace.py tool generates interactive Plotly reports. It flattens the nested phases data areal/tools/plot_session_trace.py216-217 to visualize the timeline of session execution across distributed actors.

MetricColumnDescription
Total Durationtotal_sTime from submission to finalization areal/tools/plot_session_trace.py25
Generationgenerate_sCumulative time in generation phases areal/tools/plot_session_trace.py26
Rewardreward_sTime spent in reward functions areal/tools/plot_session_trace.py27
Tool Calltoolcall_sTime spent in external tool execution areal/tools/plot_session_trace.py28

Detailed Documentation

  • Performance Tracing — Detailed guide on PerfTracer utility for method-level performance tracking.
  • Session Tracing — Detailed guide on SessionTracer for rollout session lifecycle tracking.
  • Trace Visualization — Guide for using perf_trace_converter and plot_session_trace tools.
  • Memory Optimizations — Techniques for reducing memory: gradient checkpointing, offload, per-layer optim.
  • Metrics TrackingStatsLogger and StatsTracker systems for distributed metric aggregation and multi-backend logging.