Last indexed: 7 May 2026 (2e12c1)

Reward Model Training

Reward Model (RM) training is a critical phase in the RLHF pipeline where a model is trained to predict human preferences. In AReaL, this is typically achieved by training a regression or classification head on top of a transformer backbone using preference pairs (chosen vs. rejected responses) and the Bradley-Terry objective.

Training Entry Point

The primary entry point for Reward Model training is the script examples/alignment/hhrlhf_rw.py. This script orchestrates the configuration loading, dataset preparation, and trainer initialization.

Configuration: It uses RWConfig areal/api/cli_args.py15 to define the training parameters and load_expr_config to parse CLI arguments examples/alignment/hhrlhf_rw.py10
Tokenizer: The tokenizer is loaded via load_hf_tokenizer using the path specified in the configuration examples/alignment/hhrlhf_rw.py12
Dataset Setup: It calls get_custom_dataset examples/alignment/hhrlhf_rw.py14 for both training and validation splits, passing the specific dataset_config examples/alignment/hhrlhf_rw.py16-21
Trainer Execution: The RWTrainer context manager is initialized, and trainer.train() is called to begin the optimization loop examples/alignment/hhrlhf_rw.py25-28

Sources:

Preference Data Processing

The core of Reward Model training lies in the processing of preference datasets. AReaL provides specialized utilities to handle datasets like HH-RLHF (Anthropic's Helpful and Harmless dataset).

HH-RLHF Dataset Implementation

The get_hhrlhf_rw_dataset function areal/dataset/hhrlhf.py6 handles the specific preprocessing for Reward Modeling:

Tokenization: It encodes the "chosen" and "rejected" strings, appending the tokenizer.eos_token to each areal/dataset/hhrlhf.py15-16
Filtering: If a max_length is provided, samples where either the chosen or rejected sequence exceeds the limit are filtered out areal/dataset/hhrlhf.py21-28
Mapping: The raw text columns are removed, leaving only chosen_ids and rejected_ids areal/dataset/hhrlhf.py19

Collation Logic

For Reward Modeling, data must be formatted into pairs. The rw_modeling_collate_fn areal/trainer/rw_trainer.py54 handles this transformation:

Input: Items containing chosen_ids and rejected_ids areal/trainer/rw_trainer.py57-60
Output: A list of dictionaries where each pair is expanded into two sequential entries (chosen first, then rejected) areal/trainer/rw_trainer.py61-64
Tensors: Each entry contains input_ids and an attention_mask areal/trainer/rw_trainer.py69-72

Sources:

Data Flow: From Dataset to Reward Model

The following diagram illustrates how preference data is transformed from raw text into tokenized IDs suitable for Bradley-Terry loss computation within the training engines.

Natural Language to Code Entity Mapping: Preference Data

Natural Language Concept	Code Entity / Function	File Path
HH-RLHF Loader	`get_hhrlhf_rw_dataset`	areal/dataset/hhrlhf.py6
Preference Pair	`chosen_ids`, `rejected_ids`	areal/dataset/hhrlhf.py17
Collation Function	`rw_modeling_collate_fn`	areal/trainer/rw_trainer.py54
Training Engine	`FSDPRWEngine` / `MegatronRWEngine`	areal/trainer/rw_trainer.py47

Preference Data Transformation Flow

"This diagram tracks the flow of a single preference sample through the preprocessing pipeline into the distributed engines."

Sources:

Comparison with DPO

While Reward Model (RW) training produces a scalar model to score responses, Direct Preference Optimization (DPO) uses preference signals to align a policy directly. AReaL provides separate dataset loaders for these tasks.

Feature	Reward Model (RW)	Direct Preference Optimization (DPO)
Dataset Function	`get_hhrlhf_rw_dataset`	`get_hhrlhf_dpo_dataset`
Data Fields	`chosen_ids`, `rejected_ids`	`chosen_ids`, `rejected_ids`, `chosen_loss_mask`, `rejected_loss_mask`
Collate Unit	`rw_modeling_collate_fn`	`dpo_modeling_collate_fn`
Trainer Class	`RWTrainer`	`DPOTrainer`
Loss Masking	Global attention mask	Response-only loss masking

The get_hhrlhf_dpo_dataset areal/dataset/hhrlhf.py33 specifically calculates a prompt_len by comparing sequences to generate loss_mask arrays areal/dataset/hhrlhf.py55-68 whereas RW training typically uses the full sequence.

Sources:

Training Orchestration

The RWTrainer areal/trainer/rw_trainer.py76 orchestrates the high-level training loop. It utilizes backend-specific engines to manage the distribution of reward training tasks.

Training Execution Loop

"The interaction between dataset loaders, the RWTrainer, and the underlying training engines."

Sources:

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.9-reward-model-training