VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/11.3-checkpoint-metadata

⇱ Checkpoint Metadata | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Checkpoint Metadata

This page documents the core data structures used for checkpoint-related metadata in AReaL. These structures capture information about training schedules (FinetuneSpec), individual parameter specifications (ParamSpec), and checkpoint save/load operations (SaveLoadMeta). It also covers the RecoverInfo structure used by the recovery system to resume training and handle state persistence across distributed ranks.

For information about weight synchronization between training and inference engines, see Weight Update Metadata. For information about the actual checkpoint saving and recovery implementation, see Checkpointing and Recovery.

Overview

AReaL's checkpoint metadata system consists of several primary dataclasses that serve distinct purposes in the checkpoint lifecycle:

StructurePurposePrimary Use Case
FinetuneSpecTraining schedule specificationCalculating total training steps and epoch boundaries
ParamSpecParameter metadataDescribing individual parameter properties and memory requirements
SaveLoadMetaCheckpoint I/O configurationCoordinating model save/load operations with tokenizer and processor
RecoverInfoTraining state persistenceBundling step info, dataloader states, and component metadata for recovery

Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-46

FinetuneSpec

Structure Definition

The FinetuneSpec dataclass encapsulates training schedule information, enabling precise calculation of training steps and epoch boundaries areal/api/io_struct.py132-136:


Computed Properties

FinetuneSpec provides two computed properties that derive training schedule information:

PropertyFormulaDescription
total_train_stepstotal_train_epochs × (dataset_size // train_batch_size)Total number of optimizer steps across all epochs (assumes drop_last=True)
steps_per_epochdataset_size // train_batch_sizeNumber of optimizer steps per epoch

Sources: areal/api/io_struct.py137-145

Usage in Training Loop

The FinetuneSpec is used to initialize learning rate schedulers and compute epoch boundaries. The drop_last=True assumption ensures consistent batch sizes across training steps. It is also used by the StatsLogger to log progress relative to the total training duration areal/utils/stats_logger.py138-142

Title: "FinetuneSpec Data Flow"


Sources: areal/api/io_struct.py132-145 areal/utils/recover.py150-155 areal/utils/stats_logger.py138-142

ParamSpec

Structure Definition

The ParamSpec dataclass describes individual model parameters with their metadata areal/api/io_struct.py147-152:


Memory Size Calculation

ParamSpec provides a size property that computes the parameter's memory footprint in bytes:


This calculation multiplies the torch dtype's itemsize by the product of the tensor shape areal/api/io_struct.py153-156

SaveLoadMeta

Structure Definition

The SaveLoadMeta dataclass coordinates all information needed for checkpoint save and load operations areal/api/io_struct.py288-296:


Field Descriptions

FieldTypeDescription
pathstrFilesystem path where checkpoint is saved or loaded
weight_formatstrFormat specification (e.g., "safetensors", "torch", "hf") areal/utils/saver.py152
with_optimboolWhether to include optimizer states in checkpoint
tokenizerPreTrainedTokenizerFast | NoneTokenizer instance to save/load alongside weights
processorAutoProcessor | NoneProcessor for vision-language models (e.g. Qwen2-VL)
base_model_pathstr | NonePath to base model (used for LoRA adapter checkpoints)
naive_distributedboolWhether to use naive distributed saving (all ranks save full state)

Sources: areal/api/io_struct.py288-296 areal/utils/saver.py150-158

Checkpoint Operation Flow

Title: "SaveLoadMeta Checkpoint Lifecycle"


Sources: areal/api/io_struct.py288-296 areal/utils/saver.py122-159

Recovery Metadata

RecoverInfo Structure

The RecoverInfo dataclass persists the complete state of a training trial, including component-specific metadata and dataloader states areal/utils/recover.py36-45:


Persistence and Serialization

The RecoverInfo.dump method handles serialization. In distributed environments, it performs an all_gather_object to collect dataloader_info from all ranks to ensure the StatefulDataLoader can resume exactly where it left off areal/utils/recover.py52-62

The structure is split into multiple files in the recover_info directory areal/utils/recover.py65-94:

  • step_info.json: Contains the StepInfo (epoch, steps).
  • dataloader_info.pkl: Pickled state of the StatefulDataLoader (rank-specific in distributed mode).
  • saver_info.json, evaluator_info.json, stats_logger_info.json, checkpoint_info.json: Component metadata.

Sources: areal/utils/recover.py47-94

FP8 Checkpoint Metadata

For models using FP8 quantization (e.g., via ArchonEngine), additional metadata is required to handle blockwise scaling factors.

FP8 Detection and Preparation

The system detects FP8 checkpoints by searching for *_scale_inv keys in the model.safetensors.index.json file areal/experimental/models/archon/fp8_checkpoint.py34-44 Before loading these via Distributed Checkpoint (DCP), the hf_state_dict must be mutated to change placeholder dtypes from BF16 to float8_e4m3fn and insert float32 placeholders for the inverse scales areal/experimental/models/archon/fp8_checkpoint.py47-116

Sharded Dequantization

When loading FP8 weights into an FSDP-sharded model, the system performs local dequantization. For Shard(0) placements, it calculates the global_offset and slices the scale_inv tensor to match the local shard's row range areal/experimental/models/archon/fp8_checkpoint.py152-196

Sources: areal/experimental/models/archon/fp8_checkpoint.py34-196

MoE Checkpoint Metadata

For Mixture-of-Experts (MoE) models, particularly when using ArchonEngine with Expert Parallelism (EP), the metadata must track how expert weights are sharded across the device mesh.

MoEConversionState

The MoEConversionState dataclass holds metadata for DTensor-aware MoE expert weight conversion areal/experimental/models/archon/moe_weight_converter.py18-31:

DTensor-Aware Conversion

The MoEWeightConverter uses this metadata to split 3D expert parameters into lists of 2D parameters for HuggingFace compatibility without incurring local memory overhead during save/load areal/experimental/models/archon/moe_weight_converter.py46-53

Sources: areal/experimental/models/archon/moe_weight_converter.py18-61

Integration with Checkpoint System

Title: "Metadata Hierarchy and Code Associations"


Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-144 areal/utils/saver.py23-33 areal/experimental/models/archon/moe_weight_converter.py18-61

Summary Table

DataclassPrimary FieldsComputed PropertiesPrimary Consumer
FinetuneSpectotal_train_epochs, dataset_size, train_batch_sizetotal_train_steps, steps_per_epochTrainer initialization, LR scheduler, StatsLogger
ParamSpecname, shape, dtypesizeMemory planning, distributed checkpoint
SaveLoadMetapath, weight_format, with_optim, tokenizer, processorNoneSaver, RecoverHandler, TrainEngine
RecoverInfolast_step_info, dataloader_info, checkpoint_infoNoneRecoverHandler, StatefulDataLoader
MoEConversionStategrouped_expert_weight_placements, local_experts_indicesNoneMoEWeightConverter, ArchonEngine

Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-46 areal/experimental/models/archon/moe_weight_converter.py18-45