Last indexed: 7 May 2026 (2e12c1)

MicroBatchSpec and Data Configurations

This page documents the micro-batching configuration system and data processing pipeline in AReaL. Micro-batching is a critical component for efficient distributed training, allowing large batches to be processed in smaller chunks to manage GPU memory and enable gradient accumulation.

Scope: This page covers MicroBatchSpec configuration for controlling micro-batch splitting, the data structures used during training (MicroBatchList, MicroBatchItem), and the data processing pipeline from raw sequences to packed tensors. For dataset loading configurations, see 10.5. Datasets and Reward Functions

MicroBatchSpec Overview

The MicroBatchSpec dataclass controls how batches are divided into micro-batches during training. This configuration is specified in TrainEngineConfig.mb_spec and applies to both forward and backward passes.

MicroBatch Configuration Flow

The following diagram illustrates how configuration parameters from MicroBatchSpec drive the splitting logic in areal/utils/data.py.

Sources: areal/api/cli_args.py99-139 areal/utils/data.py477-593 areal/utils/data.py244-270

MicroBatchSpec Parameters

Parameter	Type	Default	Description
`n_mbs`	`int \| None`	`1`	Number of micro-batches (or minimum number if `max_tokens_per_mb` is set). areal/api/cli_args.py102-108
`granularity`	`int`	`1`	Adjacent sequences are grouped by this size when dividing micro-batches. Useful for group-based algorithms like GRPO. areal/api/cli_args.py109-114
`max_tokens_per_mb`	`int \| None`	`None`	Maximum tokens per micro-batch for each forward pass. When set, `n_mbs` becomes the minimum count. areal/api/cli_args.py115-120
`n_mbs_divisor`	`int`	`1`	Divisor for the number of micro-batches. Final count will be adjusted to be divisible by this. areal/api/cli_args.py121-126
`packing_algorithm`	`str`	`"ffd"`	Sequence packing algorithm for allocation. Supported: `"ffd"` (First Fit Decreasing), `"kk"` (Karmarkar-Karp). areal/api/cli_args.py127-140

Parameter Interactions

The micro-batch count and assignment are determined by:

If max_tokens_per_mb is None: Uses exactly n_mbs micro-batches areal/api/cli_args.py102-108
If max_tokens_per_mb is set: Uses the selected packing_algorithm to allocate sequences, with n_mbs as the minimum count areal/api/cli_args.py115-120
Granularity constraint: Batch size must be divisible by granularity. This is validated in split_padded_tensor_dict_into_mb_list areal/utils/data.py500-502
Divisor constraint: Final micro-batch count is adjusted to be divisible by n_mbs_divisor areal/api/cli_args.py121-126

Distributed synchronization: When running distributed training, allocate_balanced_mbs_synced() ensures all data parallel ranks agree on the number of micro-batches by taking the maximum across ranks areal/utils/data.py256-270

Sources: areal/api/cli_args.py99-139 areal/utils/data.py244-270

Sequence Packing Algorithms

AReaL supports configurable algorithms for micro-batch allocation. The selection is handled via get_allocate_fn(algorithm) in areal/utils/seqpack.py areal/utils/seqpack.py167-188

Algorithm	Key	Description	Balance Quality
First Fit Decreasing	`ffd`	Greedy bin-packing. Sorts sequences by length and assigns to the first bin with capacity. areal/utils/seqpack.py196-203	Good
Karmarkar-Karp	`kk`	Largest Differencing Method. Iteratively merges imbalanced partial partitions using a max-heap. areal/utils/seqpack.py214-221	Excellent

When to use KK

The Karmarkar-Karp (kk) algorithm is recommended for large-scale RL training with highly variable sequence lengths (e.g., PPO with open-ended generation) or high DP parallelism (≥4 ranks), where even small imbalances cause significant idle time at synchronization barriers areal/api/cli_args.py132-135

Sources: areal/utils/seqpack.py161-188 areal/api/cli_args.py127-140

Data Processing Pipeline

The data processing pipeline transforms raw sequences into micro-batched tensors ready for model consumption.

Sources: areal/utils/data.py105-145 areal/utils/data.py273-322 areal/utils/data.py477-593 areal/utils/data.py693-847

Sequence Packing

Purpose: Packing converts padded 2D tensors [B, S] into 1D packed tensors [total_len] by removing padding, improving memory efficiency for variable-length sequences areal/utils/data.py273-280

Key function: pack_tensor_dict(data) at areal/utils/data.py273-322

Output: Dictionary with packed tensors and:

cu_seqlens: Cumulative sequence lengths [B+1] (e.g., [0, 10, 25, 40] for 3 sequences of lengths 10, 15, 15) areal/utils/data.py307-310
max_seqlen: Maximum sequence length in the batch areal/utils/data.py311-312

Sequence Unpacking

Key function: unpack_sequence(x, cu_seqlens) at areal/utils/data.py228-241 Splits a packed tensor back into variable-length sequences using cu_seqlens. Used during loss computation and output processing.

MicroBatchList Structure

MicroBatchList is the central data structure that flows through the training pipeline. It encapsulates micro-batches and their metadata.

Sources: areal/utils/data.py385-471

Core Attributes

Attribute	Type	Description
`data`	`dict[str, Any]`	Original input data (before splitting) areal/utils/data.py387-388
`mb_spec`	`MicroBatchSpec`	Configuration used to create this list areal/utils/data.py389-390
`mbs`	`list[dict[str, Any]]`	List of original (unpadded) micro-batch dictionaries areal/utils/data.py391-392
`forward_indices`	`list[int]`	Sequence reordering from original to micro-batch order areal/utils/data.py395-396
`backward_indices`	`list[int]`	Reverse mapping (micro-batch order to original order) areal/utils/data.py397-398
`padded_mbs`	`list[dict] \| None`	Padded micro-batches ready for model forward (set by `pad_mb_list()`) areal/utils/data.py400-401

MicroBatchItem

When iterating over MicroBatchList, each iteration yields a MicroBatchItem named tuple:

Field	Type	Purpose
`orig_mb`	`dict[str, Any]`	Original micro-batch (for loss weight computation) areal/utils/data.py369-370
`padded_mb`	`dict[str, Any]`	Padded micro-batch (for model forward pass) areal/utils/data.py371-372
`padding_length`	`int`	Batch-level padding added (for output unpadding) areal/utils/data.py373-374
`old_cu_seqlens`	`Tensor \| None`	Original `cu_seqlens` before sequence alignment areal/utils/data.py375-378

Sources: areal/utils/data.py367-383 areal/utils/data.py417-443

Micro-Batch Splitting Algorithm

The core splitting logic is implemented in split_padded_tensor_dict_into_mb_list().

Algorithm Steps

Extract sequence lengths: Sum attention_mask along sequence dimension areal/utils/data.py495-498
Group by granularity: If granularity > 1, group adjacent sequences and sum their lengths areal/utils/data.py503-510
Bin packing: Use allocate_balanced_mbs_synced() with the selected packing_algorithm to assign groups to micro-batches areal/utils/data.py512-520
Flatten to sequence indices: Convert group assignments to per-sequence indices areal/utils/data.py521-527
Reorder tensors: Reorganize all tensors according to forward_indices areal/utils/data.py532-536
Split: Divide reordered tensors into separate micro-batch dictionaries areal/utils/data.py558-581

Distribution synchronization: allocate_balanced_mbs_synced() ensures all data parallel ranks agree on the number of micro-batches by all-gathering counts and taking the maximum areal/utils/data.py256-270 This prevents deadlocks in pipeline parallel training.

Sources: areal/utils/data.py256-270 areal/utils/data.py477-593

Padding and Alignment

After splitting, micro-batches must be padded to uniform lengths for efficient batch processing via pad_mb_list().

Padding Strategies

Strategy	When Used	Function
Dynamic padding	Default (`pad_to_maximum=False`)	Each micro-batch padded to its own max sequence length areal/utils/data.py703-706
Maximum padding	`pad_to_maximum=True`	All micro-batches padded to global maximum length areal/utils/data.py707-709
Sequence alignment	Context parallel enabled	Sequences aligned to multiples of `seq_align_to` (typically CP size) areal/utils/data.py711-715
Page alignment	Memory optimization	Optionally align to page boundaries (e.g. 256 tokens) areal/utils/data.py716-720

Sources: areal/utils/data.py693-847

Configuration Examples

Example 1: GRPO with Group Granularity

Behavior: Groups every 4 consecutive sequences together (for GRPO group-based optimization) areal/api/cli_args.py109-114 Each group is treated as an indivisible unit during micro-batch assignment.

Example 2: Balanced Allocation with KK

Behavior: Uses the Karmarkar-Karp algorithm to distribute sequences into at least 4 micro-batches, ensuring near-optimal token balance across micro-batches to maximize hardware utilization areal/api/cli_args.py131-135

Sources: areal/api/cli_args.py99-139 areal/utils/data.py477-593

Advanced Features

Multi-Modal Data Support

The micro-batching system supports multi-modal inputs (e.g., vision-language models) through special handling of multi_modal_input keys in split_padded_tensor_dict_into_mb_list() areal/utils/data.py566-577 For vision models, _prepare_multimodal_forward_inputs is used to manage large tensors like pixel_values between original and padded micro-batches areal/engine/fsdp_engine.py192-215

Normalization Configurations

Reward and advantage normalization can be configured via NormConfig, supporting batch-level or group-level statistics.

Parameter	Default	Description
`mean_level`	`"batch"`	Level for mean normalization (`batch`, `group`, or `None`). areal/api/cli_args.py46-52
`std_level`	`"batch"`	Level for std normalization (`batch`, `group`, or `None`). areal/api/cli_args.py57-63
`group_size`	`1`	Size of groups for group-level normalization. areal/api/cli_args.py76-78

Sources: areal/api/cli_args.py43-97 areal/utils/data.py88-91 areal/engine/fsdp_engine.py192-215

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/2.7-microbatchspec-and-data-configurations