Last indexed: 7 May 2026 (2e12c1)

Data Processing and Utilities

This document describes AReaL's data processing utilities, which handle the transformation of training data from raw sequences to optimized micro-batches ready for distributed training. These utilities bridge the gap between data loading and model execution, implementing operations like sequence packing/unpacking, padding alignment, micro-batch splitting, and normalization.

Scope: This page covers the micro-batching system, sequence packing/padding strategies, normalization utilities, and HuggingFace integration. For training-specific data flow (tree attention packing), see Tree Training. For data loading and datasets, see Datasets and Reward Functions.

Overview

The data processing pipeline transforms data through several stages to optimize GPU memory and compute efficiency:

Data Transformation Pipeline

Sources: areal/utils/data.py104-144 areal/utils/data.py273-322

MicroBatch System

Core Data Structures

The micro-batching system organizes training data into manageable chunks for efficient memory usage and distributed processing. For details, see MicroBatch System.

Micro-Batch Class Hierarchy

MicroBatchItem areal/utils/data.py367-383 is a named tuple yielded during iteration containing both original and padded versions of the micro-batch data.

MicroBatchList areal/utils/data.py385-472 is the main container that stores micro-batches and maintains forward/backward index mappings for sequence reordering during training.

MicroBatchSpec areal/api/cli_args.py99-140 defines the splitting strategy, including constraints like max_tokens_per_mb, granularity, and the packing_algorithm (e.g., ffd or kk).

Sources: areal/utils/data.py367-472 areal/api/cli_args.py99-140

Splitting and Allocation

The function split_padded_tensor_dict_into_mb_list areal/utils/data.py477-594 implements intelligent batch splitting. It groups sequences into micro-batches that balance token counts across GPUs. Engines like FSDPEngine and MegatronEngine utilize these utilities to manage the data flow into forward/backward passes.

Sources: areal/utils/data.py477-594

Sequence Packing and Padding

Packing: From Padded to Packed Format

pack_tensor_dict areal/utils/data.py273-322 converts tensors from padded [B, S, ...] format to packed [total_length, ...] format. This eliminates padding tokens within a batch, significantly reducing computation and memory overhead for variable-length sequences. For details, see Sequence Packing and Padding.

Sequence Packing Logic

Sources: areal/utils/data.py273-322

Load Balancing with KK and FFD

AReaL supports configurable sequence packing algorithms to improve load balancing across data-parallel ranks.

First Fit Decreasing (FFD): A greedy heuristic that is fast but may lead to imbalance in bimodal distributions areal/utils/seqpack.py196-203
Karmarkar-Karp (KK): The "Largest Differencing Method" which produces near-optimal balance for large-scale RL training areal/utils/seqpack.py163-164

Sources: areal/utils/seqpack.py161-203

Alignment and Memory Optimization

pad_mb_list areal/utils/data.py755-817 applies padding to packed tensors to align them with GPU memory page boundaries (default 256 tokens). This optimization reduces GPU memory fragmentation and ensures compatibility with context parallelism strategies like Ulysses.

Sources: areal/utils/data.py755-817

Normalization and Estimation

Adaptive Normalization

The Normalization class areal/utils/data.py1152-1383 provides flexible normalization for rewards and advantages. It supports different levels (batch, group) and specialized RL techniques like leave-one-out mean areal/utils/data.py1315-1336 to reduce bias. For details, see Normalization and Estimation.

NormConfig areal/api/cli_args.py42-95 controls these behaviors, allowing users to specify mean_level, std_level, and group_size.

Sources: areal/utils/data.py1152-1383 areal/api/cli_args.py42-95

KL Divergence Estimation

KLEstimator areal/utils/data.py1385-1443 computes approximate KL divergence between the current policy and a reference model. It supports multiple estimator types (k1, k2, k3) to balance accuracy and non-negativity constraints.

Sources: areal/utils/data.py1385-1443

HuggingFace Utilities

AReaL provides wrappers for loading and configuring HuggingFace components. This includes the instantiation of tokenizers and multi-modal processors for vision-language tasks. For details, see HuggingFace Utilities.

Sources: areal/utils/hf_utils.py121

Datasets and Reward Functions

AReaL includes built-in support for several common RL alignment datasets, providing specialized loaders for both SFT and RL phases:

GSM8K: Math reasoning areal/dataset/gsm8k.py6-63
CLEVR: Visual counting with image processing areal/dataset/clevr_count_70k.py38-206
Geometry3K: Geometric reasoning areal/dataset/geometry3k.py44-206
HH-RLHF: Preference datasets for reward modeling areal/dataset/hhrlhf.py6-29
ToRL: Multi-source reasoning data areal/dataset/torl_data.py62-100

For details on dataset implementation and custom reward functions, see Datasets and Reward Functions.

Sources: areal/dataset/gsm8k.py1-63 areal/dataset/clevr_count_70k.py1-206 areal/dataset/geometry3k.py1-206 areal/dataset/hhrlhf.py1-29 areal/dataset/torl_data.py1-100

Tree Training

For efficient RL on sequences with common prefixes (e.g., multi-sample rollouts from the same prompt), AReaL supports Tree Training. This involves specialized structures to eliminate redundant prefix computations.

This system is integrated across backends, including FSDPEngine and ArchonEngine. For details, see Tree Training.

Sources: areal/engine/fsdp_engine.py100-104

Distributed Data Service

For distributed data loading in large-scale clusters, AReaL provides a DataController and RDataset architecture. This system manages remote dataset proxies, prefetch buffers, and a microservice-based data delivery pipeline (DataWorker, Gateway, Router). For details, see Distributed Data Service.

Sources: areal/infra/data_service/controller.py1-50

Additional Utilities

Batch Size Inference

get_batch_size areal/utils/data.py26-46 infers the batch size from a dictionary of tensors by checking keys like attention_mask, cu_seqlens, or the first dimension of any present tensor.

Distributed Data Operations

AReaL includes robust utilities for synchronizing data across ranks:

broadcast_tensor areal/utils/data.py954-975: Broadcasts a single tensor across a process group.
all_gather_tensor_container areal/utils/data.py977-1022: Gathers data from all ranks, handling padding for mismatched shapes.

Recovery and Checkpointing

Utilities in areal/utils/recover.py manage the state required to resume training from interruptions. This includes the RecoverInfo class areal/utils/recover.py41-50 which serializes step information, dataloader states, and evaluator info. The Saver class areal/utils/saver.py23-33 manages both synchronous and asynchronous checkpointing (specifically for ArchonEngine).

Sources: areal/utils/data.py26-46 areal/utils/data.py954-1022 areal/utils/recover.py41-150 areal/utils/saver.py23-191

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/10-data-processing-and-utilities