Last indexed: 7 May 2026 (2e12c1)

DPO Implementation

Direct Preference Optimization (DPO) in AReaL provides a stable and efficient alternative to RLHF by directly optimizing a policy against preference data without requiring a separate reward model or complex online reinforcement learning loops.

Architecture and Overview

The DPO implementation in AReaL follows a contrastive learning paradigm where a trainable policy (the "Actor") is optimized to assign higher log-probabilities to preferred ("chosen") responses than to "rejected" responses, relative to a frozen reference model.

Key Components

Component	Role	Code Entity
DPOTrainer	Orchestrates the high-level training loop, data loading, and evaluation.	`DPOTrainer` areal/trainer/dpo_trainer.py84
DPOEngine	Handles the low-level compute logic for forward/backward passes on workers.	`DPOEngine` areal/trainer/dpo/dpo_engine.py19
DPOController	Manages distributed RPC calls and data dispatching to workers.	`DPOController` areal/trainer/dpo/dpo_engine.py112
Reference Model	Frozen model used to compute baseline log-probabilities.	`self.ref` areal/trainer/dpo_trainer.py115

Data Flow and Code Entity Mapping

The following diagram illustrates how the DPO training process bridges high-level algorithmic concepts to specific code entities within the AReaL framework.

DPO Training Data Flow

Sources: areal/trainer/dpo_trainer.py84-147 areal/trainer/dpo/dpo_engine.py19-112 areal/trainer/dpo_trainer.py54-81 areal/dataset/hhrlhf.py33-78

Algorithmic Implementation

AReaL supports two primary preference optimization variants: the original DPO (Bradley-Terry based) and IPO (Identity Preference Optimization).

Loss Functions

The core loss computation occurs in compute_dpo_loss. It processes log-probabilities from both the actor and reference models for both chosen and rejected completions.

Sigmoid (DPO): Uses the standard sigmoid-based contrastive loss with a Bradley-Terry model assumption areal/trainer/dpo/dpo_engine.py193-195
IPO: A variant that optimizes the log-likelihood ratio directly with a squared loss, providing stronger regularization areal/trainer/dpo/dpo_engine.py196-198

Loss Computation Logic

Sources: areal/trainer/dpo/dpo_engine.py151-224

Trainer Orchestration

The DPOTrainer manages the lifecycle of the training process. Unlike PPO, DPO in AReaL is typically used in an offline fashion, consuming pre-existing preference datasets.

Initialization and Colocation

During initialization, the trainer creates both an actor and a reference engine.

Actor: The model being optimized areal/trainer/dpo_trainer.py112
Reference: A frozen copy of the model (often the SFT checkpoint) areal/trainer/dpo_trainer.py115

The reference model can be configured to offload to CPU to save GPU memory when not in use areal/trainer/dpo_trainer.py116

Training Loop

The main training loop in DPOTrainer.train() performs the following for each step:

Data Loading: Fetches a batch of preference pairs from the StatefulDataLoader areal/trainer/dpo_trainer.py194
Dispatch: The DPOController partitions the batch across data-parallel ranks using _dispatch_tensors with a group_size=2 to ensure chosen/rejected pairs stay together areal/infra/controller/train_controller.py76-123
Engine Execution: Calls train_dpo on the workers. The DPOEngine handles the specific sequence of calling the actor (forward/backward) and the reference model (forward only) areal/trainer/dpo/dpo_engine.py33-66
Logging: Collects metrics including reward_accuracy, reward_margin, and KL divergence areal/trainer/dpo/dpo_engine.py205-215

Sources: areal/trainer/dpo_trainer.py191-230 areal/trainer/dpo/dpo_engine.py33-66 areal/infra/controller/train_controller.py76-123

Preference Dataset Handling

DPO requires datasets structured with chosen and rejected responses for a given prompt.

RDataset and Data Service

For large-scale training, AReaL uses the RDataset and DataController to stream preference pairs areal/trainer/dpo_trainer.py118-131

Mapping: The dataset must provide chosen_ids and rejected_ids areal/dataset/hhrlhf.py63-64
Collation: The dpo_modeling_collate_fn transforms dataset items into pairs of dicts containing input_ids, attention_mask, and loss_mask areal/trainer/dpo_trainer.py54-81
Dispatch Logic: In DPOController, the group_size is set to 2 during dispatch to ensure that a chosen response and its corresponding rejected response are always processed on the same data-parallel rank areal/trainer/dpo/dpo_engine.py121

Implementation Example: Reward Model Training

The same preference data handling logic is shared with RWTrainer (Reward Model Trainer), which uses a Bradley-Terry ranking loss areal/trainer/rw_trainer.py54-73

Preference Data Entity Mapping

Sources: areal/trainer/dpo_trainer.py54-81 areal/infra/controller/train_controller.py76-93 areal/trainer/dpo/dpo_engine.py121-125 areal/dataset/hhrlhf.py33-78

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/7.8-dpo-implementation