Last indexed: 7 May 2026 (2e12c1)

Sequence Packing and Padding

Purpose and Scope

This page documents AReaL's sequence packing and padding utilities, which transform data between different tensor representations for efficient training and inference. These utilities handle:

Converting between padded format ([B, S, ...] with attention_mask) and packed format ([total_length, ...] with cu_seqlens) .
Padding sequences to maximum lengths with alignment constraints .
Handling multi-modal inputs (images, videos) alongside text .
Memory-efficient operations that reduce fragmentation and enable context parallelism .
Support for specialized tree-based attention structures used in RL reasoning tasks .
Advanced Sequence Allocation: Support for greedy (FFD) and optimal (Karmarkar-Karp) algorithms for balancing workload across DP ranks .

For information about how these utilities are used in micro-batching, see MicroBatch System.

Sources: , ,

Data Format Representations

AReaL supports two primary tensor representations for sequence data:

Padded Format

The standard format where all sequences in a batch are padded to the same maximum length.

Component	Shape	Description
Tensors (e.g., `input_ids`)	`[B, S, ...]`	Batch size B, sequence length S .
`attention_mask`	`[B, S]`	Boolean mask: 1 for valid tokens, 0 for padding .
`position_ids`	`[B, S]`	Position indices for each token.

Packed Format

A memory-efficient format that removes padding by concatenating sequences end-to-end. This is the native format for Flash Attention and AReaL's internal training loop.

Component	Shape	Description
Tensors (e.g., `input_ids`)	`[total_length, ...]`	Concatenated sequences without padding .
`cu_seqlens`	`[B+1]`	Cumulative sequence lengths (starts with 0) .
`max_seqlen`	scalar	Maximum sequence length in the batch .
`position_ids`	`[total_length]`	Position indices for each token.

Sources: ,

Workload Balancing and Allocation

To minimize synchronization overhead in distributed training, AReaL provides algorithms to allocate variable-length sequences to different ranks. This prevents idle time where some ranks wait for others to finish processing longer sequences.

Supported Allocation Algorithms

The system selects an allocation function via get_allocate_fn based on the packing_algorithm configuration .

First Fit Decreasing (FFD): A greedy heuristic that sorts sequences by length and assigns them to the first available bin that has capacity .
Karmarkar-Karp (KK): Also known as the Largest Differencing Method. It uses a max-heap to iteratively merge the most imbalanced partitions. It produces near-optimal balance for large-scale RL training where sequence lengths vary significantly . It is particularly recommended for bimodal sequence distributions and high DP parallelism .

Redistribution Logic

During rollout, the DistRolloutCoordinator uses these algorithms to redistribute trajectories across the data-parallel group. It gathers trajectories from all ranks via all_gather_tensor_container, removes padding, and reallocates them to ranks for balanced training using the selected packing algorithm .

Allocation Process Diagram Title: Sequence Allocation and Redistribution

Sources: , , ,

Packing and Unpacking Pipeline

Logic Flow: Natural Language to Packed Tensors

This diagram bridges the gap between raw text sequences and the internal pack_tensor_dict representation used by training engines.

Title: Data Flow from Sequences to Packed Tensors

Sources: , ,

Packing Operations

`pack_tensor_dict`

The pack_tensor_dict function converts a dictionary of padded tensors into packed format. It is used in the data pipelines of all training backends to ensure memory efficiency .

Algorithm:

Extract sequence lengths from attention_mask: lens = attention_mask.sum(dim=1) .
Compute cumulative sequence lengths: cu_seqlens = F.pad(torch.cumsum(lens, dim=0, dtype=torch.int32), (1, 0)) .
For each tensor of shape [B, S, ...]:
- Create empty output tensor of shape [total_length, ...] .
- Copy valid tokens from each sequence using unpad_input logic .
Replace attention_mask with cu_seqlens and max_seqlen .

Sources:

Padding Operations

Padding Packed Tensors with Alignment

The pad_packed_tensor_dict function pads packed tensors with two levels of alignment. This is critical for Context Parallelism (Ulysses) and Flash Attention efficiency .

Title: Two-Stage Alignment and Padding Logic

Alignment Constraints:

Sequence-level (seq_align_to): Required for Ulysses sequence parallel alignment where sequences must be evenly divisible across ranks .
Batch-level: Aligns to DEFAULT_PAGE_SIZE_BYTES (mapped to N_TOKENS_PER_PAGE = 256) to match GPU memory page alignment and kernel requirements .

Sources: ,

Unpadding Operations

Unpadding Logits

The unpad_logits function removes padding from model outputs before loss computation.

Process:

Remove batch-level padding: Truncates the last padding_length tokens from the flattened logit tensor .
Remove sequence-level padding: If old_cu_seqlens is provided, it restores original boundaries by slicing out extra tokens added for seq_align_to .

Sources:

Multi-modal Data Handling

Multi-modal inputs (e.g., images) are handled as pass-through lists within tensor dictionaries to avoid flattening spatial dimensions that aren't sequence-packed.

Logic Flow for Multi-modal Keys:

Detected via is_multi_modal_key(key) checking for prefix multi_modal_input .
During pack_tensor_dict, these keys are preserved as lists to be consumed by specific vision encoders .
During pad_sequences_to_tensors, they are concatenated as lists rather than stacked as tensors .

Sources:

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/10.2-sequence-packing-and-padding