VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/3.7-checkpointing-and-recovery

⇱ Checkpointing and Recovery | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Checkpointing and Recovery

This document describes AReaL's checkpointing and recovery system, which enables fault-tolerant training by periodically saving model weights, optimizer states, and training metadata. The system supports resuming training from any saved checkpoint with exact reproducibility.

For configuration of when and where to save checkpoints, see Configuration Overview. For model loading at initialization, this is handled during engine initialization (see TrainEngine API).


Overview

AReaL's checkpoint system consists of three main components:

  1. Model and Optimizer Checkpoints: Saved in either HuggingFace (HF) or Distributed Checkpoint (DCP) format.
  2. Recovery Metadata: Training state including step counters, dataloader state, and component states.
  3. Backend-Specific Implementations: Different checkpoint formats and strategies per training backend (FSDP, Megatron, Archon).

Checkpoint Components and Data Flow


Sources: areal/utils/recover.py41-50 areal/utils/recover.py52-94 areal/utils/recover.py168-176


Checkpoint Formats

AReaL supports two checkpoint formats, each with different tradeoffs:

HuggingFace (HF) Format

The HuggingFace format stores model weights in a format compatible with the transformers library, enabling easy model sharing and deployment.

Characteristics:

  • Portability: Can be loaded directly by HuggingFace transformers.
  • Single rank save: Typically, only rank 0 performs the final save operation after gathering.
  • Format: .safetensors or .bin files with config.json.
  • Use case: Final model export, sharing checkpoints.

Implementation:

Distributed Checkpoint (DCP) Format

The DCP format is PyTorch's native distributed checkpoint format, optimized for large-scale distributed training.

Characteristics:

  • Parallel I/O: Each rank saves its sharded state independently.
  • Optimizer support: Includes optimizer state by default.
  • Scalability: Efficient for very large models with many GPUs.
  • Format: Multiple files in a directory structure.
  • Use case: Training checkpoints, intermediate saves, large models.

Implementation:

Sources: areal/experimental/engine/archon_checkpoint.py86-145 areal/engine/fsdp_utils/__init__.py110-148


Recovery Metadata System

The RecoverHandler manages training state beyond model weights, enabling exact training resumption.

RecoverInfo Structure

The RecoverInfo dataclass stores all metadata required for recovery.


Components:

ComponentTypePurposeFile
last_step_infoStepInfoTraining counters (epoch, step)step_info.json
saver_infodictSaver component statesaver_info.json
evaluator_infodictEvaluator component stateevaluator_info.json
stats_logger_infodictStatsLogger statestats_logger_info.json
dataloader_infodict|listStatefulDataLoader state (per rank)dataloader_info.pkl
checkpoint_infodictCheckpoint-specific metadatacheckpoint_info.json

Sources: areal/utils/recover.py41-50 areal/utils/recover.py151-166


Backend-Specific Implementations

Archon Engine Checkpointing

The ArchonEngine uses a custom DCPState wrapper that handles both single-model and pipeline parallel configurations.

Key Design Decisions (from DCPState):

Sources: areal/experimental/engine/archon_checkpoint.py36-166

FSDP Engine Checkpointing

FSDP2 integration uses fully_shard and specialized state dict utilities.

Key features:

Sources: areal/engine/fsdp_utils/__init__.py110-148 areal/engine/fsdp_utils/optimizer.py44-261


Dataloader State Management

AReaL tracks dataloader state to ensure that after recovery, the training resumes from the exact data sample where it left off.

Distributed Dataloader Recovery

In distributed training, each rank has different dataloader state. The RecoverHandler coordinates state saving and loading across ranks:


Implementation Details:

Sources: areal/utils/recover.py57-66 areal/utils/recover.py128-136


Recovery Handler Interface

The RecoverHandler class provides the main interface for checkpoint recovery.

Key Methods

MethodLine ReferencePurpose
dump(...)areal/utils/recover.py214-254Save current training state to recovery checkpoint.
load(...)areal/utils/recover.py256-310Attempt to recover from the latest checkpoint.
recover_info_path(...)areal/utils/recover.py168-176Determine recovery path using experiment and trial names.

Recovery Logic:

  1. Check if recovery mode is enabled areal/utils/recover.py216-217
  2. Validate frequency control (epochs, steps, or seconds) areal/utils/recover.py161-165
  3. Serialize all metadata components via RecoverInfo.dump() areal/utils/recover.py52-94

Sources: areal/utils/recover.py151-310


Async Checkpointing

AReaL supports asynchronous checkpointing for the ArchonEngine to minimize training stalls.

Async Mode Management

The Saver class manages AsyncCheckpointManager instances to handle background saving.

Workflow:

  1. Trigger: Saver.save() checks if the engine supports async mode areal/utils/saver.py122-148
  2. Staging: Weights are staged to a buffer. maybe_wait_for_staging() is called before the next update to ensure the buffer is free areal/utils/saver.py182-186
  3. Background Save: save_model_to_hf is invoked with an async_mgr to perform I/O in a separate thread areal/utils/saver.py173-180

Sources: areal/utils/saver.py98-186


FP8 Checkpoint Support

AReaL provides specialized support for loading FP8-quantized checkpoints, particularly for ArchonEngine.

FP8 Detection and Preparation

The system includes heuristics to detect blockwise FP8 checkpoints (e.g., DeepSeek-V3 or Qwen3-FP8) by searching for *_scale_inv keys in the index file areal/experimental/models/archon/fp8_checkpoint.py34-44 Before loading via DCP, _prepare_fp8_state_dict mutates placeholders to float8_e4m3fn and inserts float32 placeholders for scales areal/experimental/models/archon/fp8_checkpoint.py47-116

Dequantization Strategies

Sources: areal/experimental/models/archon/fp8_checkpoint.py34-197