WorldPM-72B-RLHFLow 🌍

📚 Introduction

📄 WorldPM (World Preference Modeling) demonstrates that preference modeling follows similar scaling laws as language modeling. Through large-scale training on 15M preference data, we reveal that preference models can learn unified preference representations.

👁 main-loss

🎯 Model Usage

Base Model and Fine-tuned Variants

WorldPM represents a breakthrough in unified preference representation learning through large-scale training. While our experiments demonstrate strong generalization capabilities across various preference scenarios, we recommend task-specific fine-tuning for optimal performance.

Base Model

🌟 WorldPM-72B

Fine-tuned Versions

Each model is fine-tuned on human preference datasets of varying sizes:

Model	Dataset	Training Scale
WorldPM-72B-HelpSteer2	HelpSteer2	7K
WorldPM-72B-UltraFeedback	UltraFeedback	100K
WorldPM-72B-RLHFLow	RLHFLow	800K

💡 Recommendation

The base WorldPM-72B model serves as an excellent starting point for custom fine-tuning. Our experiments confirm that starting from WorldPM leads to better performance compared to training from scratch.

🛠️ Requirements

👁 transformers

🚀 Quick Start

📋 Key Notes

System prompt remains empty during training (no preset preferences)
System marker is preserved
Reward computation uses the hidden state of <|endoftext|> token

💻 Usage Example with Hugging Face

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = 'Qwen/WorldPM-72B-RLHFLow'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_name,trust_remote_code=True )
model = AutoModel.from_pretrained(
 model_name, 
 config = config, 
 device_map = "auto", 
 trust_remote_code=True,
).eval()


def get_score(model, tokenizer, conversation):
 con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
 input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False)
 outputs = model(input_ids=input_ids.to(model.device))
 score = outputs[0].cpu().item()
 return score


messages = [
 [
 {'role': 'system', 'content': ""},
 {'role': 'user', 'content': "Tips for better sleep?"},
 {'role': 'assistant', 'content': "Three key tips for better sleep:\n1. Set a consistent bedtime and wake-up schedule\n2. Avoid screens 1 hour before bed\n3. Keep your bedroom cool and dark\n\nStart with these habits and you should notice improvement within a week."}
 ],
 [
 {'role': 'system', 'content': ""},
 {'role': 'user', 'content': "Tips for better sleep?"},
 {'role': 'assistant', 'content': "just try sleeping earlier or take some pills if u cant sleep lol. maybe watch tv until u get tired idk"}
 ]
]

score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages]

print(score)

📝 Citation

@article{WorldPM,
 title={WorldPM:Scaling Human Preference Modeling}, 
 author={Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin},
 journal={arXiv preprint arXiv:2505.10527},
 year={2025}
}

🤝 Community & Support

We welcome discussions and feedback from the community! Here's how you can reach out:

📝 Open an issue on GitHub for bug reports or feature requests
💡 Share your ideas and questions in GitHub Discussions
✉️ Contact the authors directly at here

Feel free to engage with us through any of these channels. We value your input and look forward to hearing from you!

Downloads last month: 37

Safetensors

Model size

73B params

Tensor type

BF16

Model tree for Qwen/WorldPM-72B-RLHFLow

Base model

Qwen/Qwen2.5-72B

Finetuned

Qwen/WorldPM-72B

Finetuned

(4)

this model

Dataset used to train Qwen/WorldPM-72B-RLHFLow

Collection including Qwen/WorldPM-72B-RLHFLow

4 items • Updated Dec 31, 2025 • 9

Paper for Qwen/WorldPM-72B-RLHFLow

Paper • 2505.10527 • Published May 15, 2025 • 34

URL: https://huggingface.co/Qwen/WorldPM-72B-RLHFLow

⇱ Qwen/WorldPM-72B-RLHFLow · Hugging Face