Last indexed: 7 May 2026 (2e12c1)

Performance Optimization Guide

This guide covers best practices for maximizing training throughput and minimizing latency in AReaL. It focuses on practical configuration choices, parallelism tuning, and monitoring techniques to achieve optimal performance.

For memory-related issues (OOM errors), see 16.2 Memory and OOM Issues For debugging distributed training problems, see 16.3 Debugging Distributed Training For specific algorithm configurations, see 2.8 Algorithm-Specific Configurations

Performance Monitoring and Profiling

AReaL provides comprehensive performance tracing infrastructure to identify bottlenecks and measure optimization impact.

PerfTracer for Method-Level Timing

The PerfTracer class records timestamped trace events for individual methods and operations, outputting Chrome Trace Format (Perfetto-compatible) visualizations. It categorizes events into functional buckets to simplify analysis areal/utils/perf_tracer.py62-85

Enabling PerfTracer:

Configure in your experiment YAML or CLI using PerfTracerConfig areal/api/cli_args.py25:

Key traced categories (PerfTraceCategory):

COMPUTE: CPU/GPU computation like forward/backward passes areal/utils/perf_tracer.py87
COMM: Distributed communication (all-reduce, broadcast) areal/utils/perf_tracer.py88
IO: Disk operations (checkpointing, data loading) areal/utils/perf_tracer.py89
SCHEDULER: Task scheduling and queue management areal/utils/perf_tracer.py91

Viewing traces:

Convert JSONL traces to Chrome format using the perf_trace_converter tool. This tool handles remapping process and thread IDs to ensure uniqueness across distributed ranks areal/tools/perf_trace_converter.py121-132

Sources: areal/utils/perf_tracer.py62-114 areal/tools/perf_trace_converter.py121-215

SessionTracer for Rollout Lifecycle Tracking

SessionTracer tracks individual rollout sessions from submission through finalization, recording phase-level breakdowns such as "generate", "reward", and "toolcall" areal/tools/plot_session_trace.py24-30

Enabling SessionTracer:

Configure using SessionTracerConfig areal/api/cli_args.py25:

Tracked phases in Workflows: Workflows like RLVRWorkflow and VisionRLVRWorkflow use decorators and context managers to mark these phases:

@trace_session("reward"): Tracks reward computation duration areal/workflow/rlvr.py82 areal/workflow/vision_rlvr.py45
async with atrace_session_phase("generate"): Tracks inference engine calls areal/workflow/rlvr.py129 areal/workflow/vision_rlvr.py94

Visualizing session traces:

This tool generates an interactive Plotly-based HTML report with:

Timeline View: Visualizes overlapping rollout sessions and their internal phases areal/tools/plot_session_trace.py196-214
Latency Analysis: Histograms of total_s, generate_s, and reward_s areal/tools/plot_session_trace.py55-60

Trace Data Flow Diagram

This diagram shows how performance data flows from the code entities to the visualization tools.

Performance Data Architecture

Sources: areal/utils/perf_tracer.py117-118 areal/workflow/rlvr.py82-130 areal/tools/plot_session_trace.py154-169 areal/tools/perf_trace_converter.py14-27

FSDP Optimization Techniques

When using the FSDPEngine, several optimizations are available to reduce communication overhead and improve step time.

Per-Layer Optimizer Step

The PerLayerOptimWrapper allows streaming optimizer states per-layer to the device. This is particularly effective when combined with CPU parameter offloading, as it avoids the massive latency of running the entire optimizer step on the CPU areal/engine/fsdp_utils/optimizer.py19-20

Configuration:

Implementation: The wrapper groups model parameters by layer tests/test_per_layer_optim_step.py130-132 During the step() call, it moves the required optimizer states (like exp_avg and exp_avg_sq) to the GPU for one layer at a time, performs the update, and then moves them back to the CPU areal/engine/fsdp_utils/optimizer.py144-153 This minimizes peak GPU memory while maintaining high throughput docs/en/best_practices/handling_oom.md170-181

AnyPrecision Optimization

For training in BFloat16 without sacrificing stability, the AnyPrecisionAdamW optimizer supports Kahan summation to track accumulated rounding errors in high precision areal/engine/fsdp_utils/optimizer.py44-52 It allows direct control over momentum_dtype and variance_dtype areal/engine/fsdp_utils/optimizer.py53-55

Sources: areal/engine/fsdp_utils/optimizer.py44-153 tests/test_per_layer_optim_step.py130-132 docs/en/best_practices/handling_oom.md167-181

Async Training Optimization

AReaL overlaps rollout generation (inference) with model updates (training). This is managed via the DistRolloutCoordinator and versioned weight updates.

System Throughput Milestones

SGLang Integration: Upgrading to SGLang improves throughput by leveraging radix attention blog/AReaL_v0_2.md13-15
Fully Asynchronous Pipeline: AReaL introduces a fully decoupled generation and training pipeline, achieving significant speedups over synchronous systems README.md102-104

Variable-Length Sequence Packing

To handle variable sequence lengths efficiently, AReaL eliminates padding by packing sequences into 1D tensors blog/AReaL_v0_2.md71-72 A dynamic allocation algorithm distributes these sequences under a token budget to maximize GPU utilization blog/AReaL_v0_2.md72-75 The utility pad_sequences_to_tensors provides a fallback, but 1D packing is preferred for performance areal/utils/data.py105-145

Context Parallelism for Large Scale

For extremely long sequences, AReaL supports packed context parallelism in backends like Megatron. This splits sequences into chunks across GPUs while maintaining load balancing for causal masking areal/engine/megatron_utils/packed_context_parallel.py11-19

Sources: blog/AReaL_v0_2.md13-75 areal/engine/megatron_utils/packed_context_parallel.py11-68 README.md102-104 areal/utils/data.py105-145

Batch Size and Micro-batching

Proper batch size configuration balances throughput and memory.

MicroBatchSpec Configuration

MicroBatchSpec controls how large batches are split for gradient accumulation areal/api/cli_args.py18:

Implementation Detail: The RLVRWorkflow builds result tensor dicts with a batch dimension of 1 areal/workflow/rlvr.py177 which are then aggregated and split according to the MicroBatchSpec by the trainer. For vision tasks, VisionRLVRWorkflow handles multi_modal_input similarly areal/workflow/vision_rlvr.py154-162 The utility get_batch_size is used throughout the system to validate these sizes areal/utils/data.py27-47

Sources: areal/workflow/rlvr.py169-177 areal/workflow/vision_rlvr.py154-162 areal/api/cli_args.py18 areal/utils/data.py27-47

Multi-Turn Memory and Efficiency

Multi-Turn Workflow Data Flow

In MultiTurnWorkflow, sequence length grows over turns. The workflow manages this by concatenating previous outputs and new prompts, using a pre-computed multi_turn_prompt_ids to avoid encode-decode inconsistencies areal/workflow/multi_turn.py41-57

Multi-Turn Rollout Execution

Sources: areal/workflow/multi_turn.py59-137

Performance Checklist

Item	Recommendation	Source
Backend	Use `SGLang` for 1.5x throughput gain in sampling scenarios.	blog/AReaL_v0_2.md60-64
Async RL	Decouple generation and training for maximum speedup.	README.md102-104
Data Transfer	Use NCCL with GPU-Direct RDMA (GDRDMA) for 1K-GPU scaling.	blog/AReaL_v0_2.md77-83
Packing	Use 1D tensor packing to eliminate padding overhead.	blog/AReaL_v0_2.md71-75
Offload	Use `per_layer_optim_step` to mitigate CPU Adam latency.	docs/en/best_practices/handling_oom.md167-181
Checkpointing	Use `AsyncCheckpointManager` (Archon) for zero-latency saves.	areal/utils/saver.py175-180

Sources: blog/AReaL_v0_2.md13-83 README.md102-104 docs/en/best_practices/handling_oom.md167-181 areal/utils/saver.py172-181

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/16.4-performance-optimization-guide