![]() |
VOOZH | about |
This page documents AReaL's sequence packing and padding utilities, which transform data between different tensor representations for efficient training and inference. These utilities handle:
[B, S, ...] with attention_mask) and packed format ([total_length, ...] with cu_seqlens) .For information about how these utilities are used in micro-batching, see MicroBatch System.
Sources: , ,
AReaL supports two primary tensor representations for sequence data:
The standard format where all sequences in a batch are padded to the same maximum length.
| Component | Shape | Description |
|---|---|---|
Tensors (e.g., input_ids) | [B, S, ...] | Batch size B, sequence length S . |
attention_mask | [B, S] | Boolean mask: 1 for valid tokens, 0 for padding . |
position_ids | [B, S] | Position indices for each token. |
A memory-efficient format that removes padding by concatenating sequences end-to-end. This is the native format for Flash Attention and AReaL's internal training loop.
| Component | Shape | Description |
|---|---|---|
Tensors (e.g., input_ids) | [total_length, ...] | Concatenated sequences without padding . |
cu_seqlens | [B+1] | Cumulative sequence lengths (starts with 0) . |
max_seqlen | scalar | Maximum sequence length in the batch . |
position_ids | [total_length] | Position indices for each token. |
Sources: ,
To minimize synchronization overhead in distributed training, AReaL provides algorithms to allocate variable-length sequences to different ranks. This prevents idle time where some ranks wait for others to finish processing longer sequences.
The system selects an allocation function via get_allocate_fn based on the packing_algorithm configuration .
During rollout, the DistRolloutCoordinator uses these algorithms to redistribute trajectories across the data-parallel group. It gathers trajectories from all ranks via all_gather_tensor_container, removes padding, and reallocates them to ranks for balanced training using the selected packing algorithm .
Allocation Process Diagram Title: Sequence Allocation and Redistribution
Sources: , , ,
This diagram bridges the gap between raw text sequences and the internal pack_tensor_dict representation used by training engines.
Title: Data Flow from Sequences to Packed Tensors
Sources: , ,
pack_tensor_dictThe pack_tensor_dict function converts a dictionary of padded tensors into packed format. It is used in the data pipelines of all training backends to ensure memory efficiency .
Algorithm:
attention_mask: lens = attention_mask.sum(dim=1) .cu_seqlens = F.pad(torch.cumsum(lens, dim=0, dtype=torch.int32), (1, 0)) .[B, S, ...]:
[total_length, ...] .unpad_input logic .attention_mask with cu_seqlens and max_seqlen .Sources:
The pad_packed_tensor_dict function pads packed tensors with two levels of alignment. This is critical for Context Parallelism (Ulysses) and Flash Attention efficiency .
Title: Two-Stage Alignment and Padding Logic
Alignment Constraints:
seq_align_to): Required for Ulysses sequence parallel alignment where sequences must be evenly divisible across ranks .DEFAULT_PAGE_SIZE_BYTES (mapped to N_TOKENS_PER_PAGE = 256) to match GPU memory page alignment and kernel requirements .Sources: ,
The unpad_logits function removes padding from model outputs before loss computation.
Process:
padding_length tokens from the flattened logit tensor .old_cu_seqlens is provided, it restores original boundaries by slicing out extra tokens added for seq_align_to .Sources:
Multi-modal inputs (e.g., images) are handled as pass-through lists within tensor dictionaries to avoid flattening spatial dimensions that aren't sequence-packed.
Logic Flow for Multi-modal Keys:
is_multi_modal_key(key) checking for prefix multi_modal_input .pack_tensor_dict, these keys are preserved as lists to be consumed by specific vision encoders .pad_sequences_to_tensors, they are concatenated as lists rather than stacked as tensors .Sources:
Refresh this wiki