VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/16.4-performance-optimization-guide

⇱ Performance Optimization Guide | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Performance Optimization Guide

This guide covers best practices for maximizing training throughput and minimizing latency in AReaL. It focuses on practical configuration choices, parallelism tuning, and monitoring techniques to achieve optimal performance.

For memory-related issues (OOM errors), see 16.2 Memory and OOM Issues For debugging distributed training problems, see 16.3 Debugging Distributed Training For specific algorithm configurations, see 2.8 Algorithm-Specific Configurations


Performance Monitoring and Profiling

AReaL provides comprehensive performance tracing infrastructure to identify bottlenecks and measure optimization impact.

PerfTracer for Method-Level Timing

The PerfTracer class records timestamped trace events for individual methods and operations, outputting Chrome Trace Format (Perfetto-compatible) visualizations. It categorizes events into functional buckets to simplify analysis areal/utils/perf_tracer.py62-85

Enabling PerfTracer:

Configure in your experiment YAML or CLI using PerfTracerConfig areal/api/cli_args.py25:


Key traced categories (PerfTraceCategory):

Viewing traces:

Convert JSONL traces to Chrome format using the perf_trace_converter tool. This tool handles remapping process and thread IDs to ensure uniqueness across distributed ranks areal/tools/perf_trace_converter.py121-132


Sources: areal/utils/perf_tracer.py62-114 areal/tools/perf_trace_converter.py121-215

SessionTracer for Rollout Lifecycle Tracking

SessionTracer tracks individual rollout sessions from submission through finalization, recording phase-level breakdowns such as "generate", "reward", and "toolcall" areal/tools/plot_session_trace.py24-30

Enabling SessionTracer:

Configure using SessionTracerConfig areal/api/cli_args.py25:


Tracked phases in Workflows: Workflows like RLVRWorkflow and VisionRLVRWorkflow use decorators and context managers to mark these phases:

Visualizing session traces:


This tool generates an interactive Plotly-based HTML report with:

Trace Data Flow Diagram

This diagram shows how performance data flows from the code entities to the visualization tools.

Performance Data Architecture


Sources: areal/utils/perf_tracer.py117-118 areal/workflow/rlvr.py82-130 areal/tools/plot_session_trace.py154-169 areal/tools/perf_trace_converter.py14-27


FSDP Optimization Techniques

When using the FSDPEngine, several optimizations are available to reduce communication overhead and improve step time.

Per-Layer Optimizer Step

The PerLayerOptimWrapper allows streaming optimizer states per-layer to the device. This is particularly effective when combined with CPU parameter offloading, as it avoids the massive latency of running the entire optimizer step on the CPU areal/engine/fsdp_utils/optimizer.py19-20

Configuration:


Implementation: The wrapper groups model parameters by layer tests/test_per_layer_optim_step.py130-132 During the step() call, it moves the required optimizer states (like exp_avg and exp_avg_sq) to the GPU for one layer at a time, performs the update, and then moves them back to the CPU areal/engine/fsdp_utils/optimizer.py144-153 This minimizes peak GPU memory while maintaining high throughput docs/en/best_practices/handling_oom.md170-181

AnyPrecision Optimization

For training in BFloat16 without sacrificing stability, the AnyPrecisionAdamW optimizer supports Kahan summation to track accumulated rounding errors in high precision areal/engine/fsdp_utils/optimizer.py44-52 It allows direct control over momentum_dtype and variance_dtype areal/engine/fsdp_utils/optimizer.py53-55

Sources: areal/engine/fsdp_utils/optimizer.py44-153 tests/test_per_layer_optim_step.py130-132 docs/en/best_practices/handling_oom.md167-181


Async Training Optimization

AReaL overlaps rollout generation (inference) with model updates (training). This is managed via the DistRolloutCoordinator and versioned weight updates.

System Throughput Milestones

  • SGLang Integration: Upgrading to SGLang improves throughput by leveraging radix attention blog/AReaL_v0_2.md13-15
  • Fully Asynchronous Pipeline: AReaL introduces a fully decoupled generation and training pipeline, achieving significant speedups over synchronous systems README.md102-104

Variable-Length Sequence Packing

To handle variable sequence lengths efficiently, AReaL eliminates padding by packing sequences into 1D tensors blog/AReaL_v0_2.md71-72 A dynamic allocation algorithm distributes these sequences under a token budget to maximize GPU utilization blog/AReaL_v0_2.md72-75 The utility pad_sequences_to_tensors provides a fallback, but 1D packing is preferred for performance areal/utils/data.py105-145

Context Parallelism for Large Scale

For extremely long sequences, AReaL supports packed context parallelism in backends like Megatron. This splits sequences into chunks across GPUs while maintaining load balancing for causal masking areal/engine/megatron_utils/packed_context_parallel.py11-19

Sources: blog/AReaL_v0_2.md13-75 areal/engine/megatron_utils/packed_context_parallel.py11-68 README.md102-104 areal/utils/data.py105-145


Batch Size and Micro-batching

Proper batch size configuration balances throughput and memory.

MicroBatchSpec Configuration

MicroBatchSpec controls how large batches are split for gradient accumulation areal/api/cli_args.py18:


Implementation Detail: The RLVRWorkflow builds result tensor dicts with a batch dimension of 1 areal/workflow/rlvr.py177 which are then aggregated and split according to the MicroBatchSpec by the trainer. For vision tasks, VisionRLVRWorkflow handles multi_modal_input similarly areal/workflow/vision_rlvr.py154-162 The utility get_batch_size is used throughout the system to validate these sizes areal/utils/data.py27-47

Sources: areal/workflow/rlvr.py169-177 areal/workflow/vision_rlvr.py154-162 areal/api/cli_args.py18 areal/utils/data.py27-47


Multi-Turn Memory and Efficiency

Multi-Turn Workflow Data Flow

In MultiTurnWorkflow, sequence length grows over turns. The workflow manages this by concatenating previous outputs and new prompts, using a pre-computed multi_turn_prompt_ids to avoid encode-decode inconsistencies areal/workflow/multi_turn.py41-57

Multi-Turn Rollout Execution


Sources: areal/workflow/multi_turn.py59-137


Performance Checklist

ItemRecommendationSource
BackendUse SGLang for 1.5x throughput gain in sampling scenarios.blog/AReaL_v0_2.md60-64
Async RLDecouple generation and training for maximum speedup.README.md102-104
Data TransferUse NCCL with GPU-Direct RDMA (GDRDMA) for 1K-GPU scaling.blog/AReaL_v0_2.md77-83
PackingUse 1D tensor packing to eliminate padding overhead.blog/AReaL_v0_2.md71-75
OffloadUse per_layer_optim_step to mitigate CPU Adam latency.docs/en/best_practices/handling_oom.md167-181
CheckpointingUse AsyncCheckpointManager (Archon) for zero-latency saves.areal/utils/saver.py175-180

Sources: blog/AReaL_v0_2.md13-83 README.md102-104 docs/en/best_practices/handling_oom.md167-181 areal/utils/saver.py172-181