Last indexed: 7 May 2026 (2e12c1)

Checkpoint Metadata

This page documents the core data structures used for checkpoint-related metadata in AReaL. These structures capture information about training schedules (FinetuneSpec), individual parameter specifications (ParamSpec), and checkpoint save/load operations (SaveLoadMeta). It also covers the RecoverInfo structure used by the recovery system to resume training and handle state persistence across distributed ranks.

For information about weight synchronization between training and inference engines, see Weight Update Metadata. For information about the actual checkpoint saving and recovery implementation, see Checkpointing and Recovery.

Overview

AReaL's checkpoint metadata system consists of several primary dataclasses that serve distinct purposes in the checkpoint lifecycle:

Structure	Purpose	Primary Use Case
`FinetuneSpec`	Training schedule specification	Calculating total training steps and epoch boundaries
`ParamSpec`	Parameter metadata	Describing individual parameter properties and memory requirements
`SaveLoadMeta`	Checkpoint I/O configuration	Coordinating model save/load operations with tokenizer and processor
`RecoverInfo`	Training state persistence	Bundling step info, dataloader states, and component metadata for recovery

Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-46

FinetuneSpec

Structure Definition

The FinetuneSpec dataclass encapsulates training schedule information, enabling precise calculation of training steps and epoch boundaries areal/api/io_struct.py132-136:

Computed Properties

FinetuneSpec provides two computed properties that derive training schedule information:

Property	Formula	Description
`total_train_steps`	`total_train_epochs × (dataset_size // train_batch_size)`	Total number of optimizer steps across all epochs (assumes `drop_last=True`)
`steps_per_epoch`	`dataset_size // train_batch_size`	Number of optimizer steps per epoch

Sources: areal/api/io_struct.py137-145

Usage in Training Loop

The FinetuneSpec is used to initialize learning rate schedulers and compute epoch boundaries. The drop_last=True assumption ensures consistent batch sizes across training steps. It is also used by the StatsLogger to log progress relative to the total training duration areal/utils/stats_logger.py138-142

Title: "FinetuneSpec Data Flow"

Sources: areal/api/io_struct.py132-145 areal/utils/recover.py150-155 areal/utils/stats_logger.py138-142

ParamSpec

Structure Definition

The ParamSpec dataclass describes individual model parameters with their metadata areal/api/io_struct.py147-152:

Memory Size Calculation

ParamSpec provides a size property that computes the parameter's memory footprint in bytes:

This calculation multiplies the torch dtype's itemsize by the product of the tensor shape areal/api/io_struct.py153-156

SaveLoadMeta

Structure Definition

The SaveLoadMeta dataclass coordinates all information needed for checkpoint save and load operations areal/api/io_struct.py288-296:

Field Descriptions

Field	Type	Description
`path`	`str`	Filesystem path where checkpoint is saved or loaded
`weight_format`	`str`	Format specification (e.g., "safetensors", "torch", "hf") areal/utils/saver.py152
`with_optim`	`bool`	Whether to include optimizer states in checkpoint
`tokenizer`	`PreTrainedTokenizerFast \| None`	Tokenizer instance to save/load alongside weights
`processor`	`AutoProcessor \| None`	Processor for vision-language models (e.g. Qwen2-VL)
`base_model_path`	`str \| None`	Path to base model (used for LoRA adapter checkpoints)
`naive_distributed`	`bool`	Whether to use naive distributed saving (all ranks save full state)

Sources: areal/api/io_struct.py288-296 areal/utils/saver.py150-158

Checkpoint Operation Flow

Title: "SaveLoadMeta Checkpoint Lifecycle"

Sources: areal/api/io_struct.py288-296 areal/utils/saver.py122-159

Recovery Metadata

RecoverInfo Structure

The RecoverInfo dataclass persists the complete state of a training trial, including component-specific metadata and dataloader states areal/utils/recover.py36-45:

Persistence and Serialization

The RecoverInfo.dump method handles serialization. In distributed environments, it performs an all_gather_object to collect dataloader_info from all ranks to ensure the StatefulDataLoader can resume exactly where it left off areal/utils/recover.py52-62

The structure is split into multiple files in the recover_info directory areal/utils/recover.py65-94:

step_info.json: Contains the StepInfo (epoch, steps).
dataloader_info.pkl: Pickled state of the StatefulDataLoader (rank-specific in distributed mode).
saver_info.json, evaluator_info.json, stats_logger_info.json, checkpoint_info.json: Component metadata.

Sources: areal/utils/recover.py47-94

FP8 Checkpoint Metadata

For models using FP8 quantization (e.g., via ArchonEngine), additional metadata is required to handle blockwise scaling factors.

FP8 Detection and Preparation

The system detects FP8 checkpoints by searching for *_scale_inv keys in the model.safetensors.index.json file areal/experimental/models/archon/fp8_checkpoint.py34-44 Before loading these via Distributed Checkpoint (DCP), the hf_state_dict must be mutated to change placeholder dtypes from BF16 to float8_e4m3fn and insert float32 placeholders for the inverse scales areal/experimental/models/archon/fp8_checkpoint.py47-116

Sharded Dequantization

When loading FP8 weights into an FSDP-sharded model, the system performs local dequantization. For Shard(0) placements, it calculates the global_offset and slices the scale_inv tensor to match the local shard's row range areal/experimental/models/archon/fp8_checkpoint.py152-196

Sources: areal/experimental/models/archon/fp8_checkpoint.py34-196

MoE Checkpoint Metadata

For Mixture-of-Experts (MoE) models, particularly when using ArchonEngine with Expert Parallelism (EP), the metadata must track how expert weights are sharded across the device mesh.

MoEConversionState

The MoEConversionState dataclass holds metadata for DTensor-aware MoE expert weight conversion areal/experimental/models/archon/moe_weight_converter.py18-31:

grouped_expert_weight_placements: Tracks the DTensor placements for grouped expert weights areal/experimental/models/archon/moe_weight_converter.py33
grouped_expert_weight_shape: Stores the original 3D shape of expert parameters areal/experimental/models/archon/moe_weight_converter.py34
local_experts_indices: Maps local GPU shards to the global expert indices areal/experimental/models/archon/moe_weight_converter.py37

DTensor-Aware Conversion

The MoEWeightConverter uses this metadata to split 3D expert parameters into lists of 2D parameters for HuggingFace compatibility without incurring local memory overhead during save/load areal/experimental/models/archon/moe_weight_converter.py46-53

Sources: areal/experimental/models/archon/moe_weight_converter.py18-61

Integration with Checkpoint System

Title: "Metadata Hierarchy and Code Associations"

Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-144 areal/utils/saver.py23-33 areal/experimental/models/archon/moe_weight_converter.py18-61

Summary Table

Dataclass	Primary Fields	Computed Properties	Primary Consumer
`FinetuneSpec`	`total_train_epochs`, `dataset_size`, `train_batch_size`	`total_train_steps`, `steps_per_epoch`	Trainer initialization, LR scheduler, StatsLogger
`ParamSpec`	`name`, `shape`, `dtype`	`size`	Memory planning, distributed checkpoint
`SaveLoadMeta`	`path`, `weight_format`, `with_optim`, `tokenizer`, `processor`	None	`Saver`, `RecoverHandler`, `TrainEngine`
`RecoverInfo`	`last_step_info`, `dataloader_info`, `checkpoint_info`	None	`RecoverHandler`, `StatefulDataLoader`
`MoEConversionState`	`grouped_expert_weight_placements`, `local_experts_indices`	None	`MoEWeightConverter`, `ArchonEngine`

Sources: areal/api/io_struct.py132-296 areal/utils/recover.py35-46 areal/experimental/models/archon/moe_weight_converter.py18-45

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/11.3-checkpoint-metadata