Last indexed: 7 May 2026 (2e12c1)

Checkpointing and Recovery

This document describes AReaL's checkpointing and recovery system, which enables fault-tolerant training by periodically saving model weights, optimizer states, and training metadata. The system supports resuming training from any saved checkpoint with exact reproducibility.

For configuration of when and where to save checkpoints, see Configuration Overview. For model loading at initialization, this is handled during engine initialization (see TrainEngine API).

Overview

AReaL's checkpoint system consists of three main components:

Model and Optimizer Checkpoints: Saved in either HuggingFace (HF) or Distributed Checkpoint (DCP) format.
Recovery Metadata: Training state including step counters, dataloader state, and component states.
Backend-Specific Implementations: Different checkpoint formats and strategies per training backend (FSDP, Megatron, Archon).

Checkpoint Components and Data Flow

Sources: areal/utils/recover.py41-50 areal/utils/recover.py52-94 areal/utils/recover.py168-176

Checkpoint Formats

AReaL supports two checkpoint formats, each with different tradeoffs:

HuggingFace (HF) Format

The HuggingFace format stores model weights in a format compatible with the transformers library, enabling easy model sharing and deployment.

Characteristics:

Portability: Can be loaded directly by HuggingFace transformers.
Single rank save: Typically, only rank 0 performs the final save operation after gathering.
Format: .safetensors or .bin files with config.json.
Use case: Final model export, sharing checkpoints.

Implementation:

Archon: areal/experimental/engine/archon_checkpoint.py187-196 - Writes model.safetensors.index.json for multi-file HF checkpoints.
FSDP: areal/engine/fsdp_utils/__init__.py110-148 - fsdp2_load_full_state_dict handles loading full state dicts into sharded FSDP2 models.

Distributed Checkpoint (DCP) Format

The DCP format is PyTorch's native distributed checkpoint format, optimized for large-scale distributed training.

Characteristics:

Parallel I/O: Each rank saves its sharded state independently.
Optimizer support: Includes optimizer state by default.
Scalability: Efficient for very large models with many GPUs.
Format: Multiple files in a directory structure.
Use case: Training checkpoints, intermediate saves, large models.

Implementation:

Archon: areal/experimental/engine/archon_checkpoint.py86-145 - Custom DCPState for Pipeline Parallel (PP) support.
Stateful Wrapper: Uses torch.distributed.checkpoint.stateful.Stateful to manage state dicts areal/experimental/engine/archon_checkpoint.py86

Sources: areal/experimental/engine/archon_checkpoint.py86-145 areal/engine/fsdp_utils/__init__.py110-148

Recovery Metadata System

The RecoverHandler manages training state beyond model weights, enabling exact training resumption.

RecoverInfo Structure

The RecoverInfo dataclass stores all metadata required for recovery.

Components:

Component	Type	Purpose	File
`last_step_info`	`StepInfo`	Training counters (epoch, step)	`step_info.json`
`saver_info`	`dict`	`Saver` component state	`saver_info.json`
`evaluator_info`	`dict`	`Evaluator` component state	`evaluator_info.json`
`stats_logger_info`	`dict`	`StatsLogger` state	`stats_logger_info.json`
`dataloader_info`	`dict\|list`	`StatefulDataLoader` state (per rank)	`dataloader_info.pkl`
`checkpoint_info`	`dict`	Checkpoint-specific metadata	`checkpoint_info.json`

Sources: areal/utils/recover.py41-50 areal/utils/recover.py151-166

Backend-Specific Implementations

Archon Engine Checkpointing

The ArchonEngine uses a custom DCPState wrapper that handles both single-model and pipeline parallel configurations.

Key Design Decisions (from DCPState):

Flattened optimizer state: Uses flatten_optimizer_state_dict=True areal/experimental/engine/archon_checkpoint.py130-133 to avoid parameter group index collisions across PP stages (keys become parameter FQNs).
Strict loading mode: PP mode uses strict=False because each stage only has a subset of keys areal/experimental/engine/archon_checkpoint.py151-153
Consolidation: Uses _consolidate_shards_distributed to distribute safetensors consolidation across ranks with correct process group barriers to avoid deadlocks areal/experimental/engine/archon_checkpoint.py36-84

Sources: areal/experimental/engine/archon_checkpoint.py36-166

FSDP Engine Checkpointing

FSDP2 integration uses fully_shard and specialized state dict utilities.

Key features:

FSDP2 Loading: fsdp2_load_full_state_dict handles broadcasting from rank 0 to all other ranks and materializing meta-tensors areal/engine/fsdp_utils/__init__.py110-148
Optimizer State: PerLayerOptimWrapper manages optimizer states per layer to reduce peak memory during updates areal/engine/fsdp_utils/optimizer.py228-261
AnyPrecision Support: AnyPrecisionAdamW allows saving/loading momentum and variance in bfloat16 to save memory areal/engine/fsdp_utils/optimizer.py44-101

Sources: areal/engine/fsdp_utils/__init__.py110-148 areal/engine/fsdp_utils/optimizer.py44-261

Dataloader State Management

AReaL tracks dataloader state to ensure that after recovery, the training resumes from the exact data sample where it left off.

Distributed Dataloader Recovery

In distributed training, each rank has different dataloader state. The RecoverHandler coordinates state saving and loading across ranks:

Implementation Details:

Save distributed state: areal/utils/recover.py57-62 - all_gather_object collects states from all ranks into a list.
Single-rank write: areal/utils/recover.py64-66 - Only rank 0 writes to avoid filesystem contention.
Load and broadcast: areal/utils/recover.py128-136 - Each rank selects its specific state from the loaded list based on dist.get_rank().

Sources: areal/utils/recover.py57-66 areal/utils/recover.py128-136

Recovery Handler Interface

The RecoverHandler class provides the main interface for checkpoint recovery.

Key Methods

Method	Line Reference	Purpose
`dump(...)`	areal/utils/recover.py214-254	Save current training state to recovery checkpoint.
`load(...)`	areal/utils/recover.py256-310	Attempt to recover from the latest checkpoint.
`recover_info_path(...)`	areal/utils/recover.py168-176	Determine recovery path using experiment and trial names.

Recovery Logic:

Check if recovery mode is enabled areal/utils/recover.py216-217
Validate frequency control (epochs, steps, or seconds) areal/utils/recover.py161-165
Serialize all metadata components via RecoverInfo.dump() areal/utils/recover.py52-94

Sources: areal/utils/recover.py151-310

Async Checkpointing

AReaL supports asynchronous checkpointing for the ArchonEngine to minimize training stalls.

Async Mode Management

The Saver class manages AsyncCheckpointManager instances to handle background saving.

Workflow:

Trigger: Saver.save() checks if the engine supports async mode areal/utils/saver.py122-148
Staging: Weights are staged to a buffer. maybe_wait_for_staging() is called before the next update to ensure the buffer is free areal/utils/saver.py182-186
Background Save: save_model_to_hf is invoked with an async_mgr to perform I/O in a separate thread areal/utils/saver.py173-180

Sources: areal/utils/saver.py98-186

FP8 Checkpoint Support

AReaL provides specialized support for loading FP8-quantized checkpoints, particularly for ArchonEngine.

FP8 Detection and Preparation

The system includes heuristics to detect blockwise FP8 checkpoints (e.g., DeepSeek-V3 or Qwen3-FP8) by searching for *_scale_inv keys in the index file areal/experimental/models/archon/fp8_checkpoint.py34-44 Before loading via DCP, _prepare_fp8_state_dict mutates placeholders to float8_e4m3fn and inserts float32 placeholders for scales areal/experimental/models/archon/fp8_checkpoint.py47-116

Dequantization Strategies

CPU Fallback: weight_dequant_cpu performs pure PyTorch blockwise dequantization when tensors are offloaded to CPU areal/experimental/models/archon/fp8_checkpoint.py119-150
DTensor Support: _dequant_dtensor handles FSDP-sharded FP8 weights by calculating global offsets and slicing the scale-inverse tensor accordingly areal/experimental/models/archon/fp8_checkpoint.py152-197

Sources: areal/experimental/models/archon/fp8_checkpoint.py34-197

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/3.7-checkpointing-and-recovery