VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3.5-microbatching-pipeline

⇱ Microbatching Pipeline | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Microbatching Pipeline

This page describes the microbatching system used by AReaL's training engines to split large training batches into smaller micro-batches for gradient accumulation. Microbatching enables training with effective batch sizes larger than what fits in GPU memory by processing multiple micro-batches sequentially and accumulating gradients before performing an optimizer step.

Purpose and Scope

The microbatching pipeline consists of:

  • Configuration via MicroBatchSpec dataclass .
  • Splitting algorithms that partition batches based on token counts and sequence lengths using greedy (FFD) or near-optimal (KK) approaches .
  • Padding strategies to align micro-batches for efficient processing, including Ulysses sequence parallel alignment .
  • Data structures (MicroBatchList, MicroBatchItem) for managing micro-batch metadata .
  • Integration points where training engines (FSDP, Megatron, Archon) consume micro-batches .

This system is shared across all training backends and supports both standard sequence packing and tree training modes.


Configuration: MicroBatchSpec

The MicroBatchSpec dataclass defines how batches are split into micro-batches.

FieldTypeDescription
n_mbs`intNone`
granularityintGroup adjacent sequences by this size when dividing .
max_tokens_per_mb`intNone`
n_mbs_divisorintFinal micro-batch count must be divisible by this value .
packing_algorithmstrAlgorithm for allocation: ffd (First Fit Decreasing) or kk (Karmarkar-Karp) .

Key behaviors:

  • Static allocation: When max_tokens_per_mb=None, the batch is split into exactly n_mbs micro-batches .
  • Dynamic allocation: When max_tokens_per_mb is set, the system uses the configured packing_algorithm to create balanced micro-batches respecting token limits .
  • Packing Algorithms: ffd is a greedy heuristic, while kk (Largest Differencing Method) provides near-optimal balance for large-scale RL with variable sequence lengths .

Sources: areal/api/cli_args.py99-147 areal/utils/seqpack.py161-187


Data Structures

MicroBatchList

MicroBatchList is the primary container returned by the splitting pipeline. It encapsulates the split micro-batches along with metadata needed for processing and reconstruction.


Diagram: MicroBatchList and MicroBatchItem structure

Key attributes:

  • forward_indices: Mapping from original batch order to micro-batch order .
  • backward_indices: Inverse mapping to reconstruct original order .
  • padded_mbs: List of padded micro-batch dictionaries ready for model forward .
  • old_cu_seqlens_list: Original cumulative sequence lengths before alignment (used for context parallel unpadding) .

Sources: areal/utils/data.py385-471

MicroBatchItem

MicroBatchItem is a NamedTuple yielded when iterating over a MicroBatchList .

FieldTypePurpose
orig_mbdictOriginal micro-batch for loss weight computation .
padded_mbdictPadded micro-batch for model forward pass .
padding_lengthintBatch-level padding added .
old_cu_seqlensTensorPre-alignment cumulative sequence lengths .

Sources: areal/utils/data.py367-383


Batch Splitting Pipeline

The microbatching pipeline transforms a padded batch dictionary into a MicroBatchList through several stages .


Diagram: Microbatching pipeline flow from input batch to MicroBatchList

Stage 1: Sequence Length Calculation

The pipeline extracts sequence lengths from the input batch's attention_mask . When granularity > 1, sequences are grouped before calculating total lengths .

Stage 2: Micro-batch Allocation

The system dispatches to an allocation function via get_allocate_fn(algorithm) :

  • FFD (ffd_allocate): Implements First-Fit Decreasing .
  • KK (kk_allocate): Implements Karmarkar-Karp partitioning for superior load balance .

Stage 3: Distributed Synchronization

In distributed training, all ranks must agree on the number of micro-batches to ensure consistent pipeline scheduling . The allocate_balanced_mbs_synced function performs an all_gather_object to find the maximum n_mbs across all ranks .

Sources: areal/utils/seqpack.py167-279 areal/utils/data.py244-271 areal/utils/data.py477-593


Padding Strategy

After splitting, micro-batches are padded to enable efficient tensor operations via pad_mb_list() .

Context Parallel Alignment

When context parallelism (Ulysses) is enabled, sequences must be aligned to be divisible by the sp_size (sequence parallel size) . The ulysses_pad function pads each sequence to the nearest multiple of align_to .

Memory Optimization

The pad_to_maximum flag controls whether each micro-batch is padded independently or to a global maximum length. Padding to the maximum reduces memory fragmentation but may increase total computation .

Sources: areal/utils/data.py620-718


Integration with Training Engines

All training engines follow a common pattern for consuming micro-batches, either via manual loops or pipeline schedulers.

Engine Comparison

EngineExecution StrategyKey Function
FSDPEngineSequential loop over micro-batchesforward_backward_batch
MegatronEnginePipeline-parallel scheduletrain_batch / forward_backward_func
ArchonEngineSequential or Pipeline-parallelForwardBackwardRunner.run

ArchonEngine Pipeline Flow

The ArchonEngine uses a ForwardBackwardRunner abstraction to handle micro-batches .


Diagram: ArchonEngine micro-batch execution logic

Key implementation details:

  • FSDPEngine: Uses forward_backward_batch to loop over MicroBatchItem and call the model .
  • MegatronEngine: Integrates with Megatron-Core's pipeline scheduler, providing the MicroBatchList as a data iterator .
  • Loss Normalization: Engines use compute_total_loss_weight to aggregate weights across the DP group for global normalization (in FSDP/Megatron) and (in Archon).

Sources: areal/engine/fsdp_engine.py540-615 areal/engine/megatron_engine.py637-706 areal/experimental/engine/archon_engine.py147-210 areal/engine/core/__init__.py62


Tree Training Support

The microbatching pipeline supports tree training, where multiple trajectories share common prefixes.

When enable_tree_training=True:

  1. Tree Construction: build_packed_tree_batch organizes trajectories into a trie structure .
  2. Context Management: The FSDPTrainContext and ArchonTrainContext carry the trie_node through the pipeline.
  3. Logprob Gathering: Specialized functions like _gather_packed_tree_logprobs and gather_packed_tree_logprobs_entropy are used to extract results from the tree structure .

Sources: areal/engine/fsdp_engine.py141-168 areal/experimental/engine/archon_engine.py124-145 areal/models/tree_attn/tree.py106 areal/models/tree_attn/functional.py97-99