WorldPM-72B-RLHFLow ๐
๐ License
๐ arXiv
๐ GitHub
๐ ModelScope
๐ Introduction
๐ WorldPM (World Preference Modeling) demonstrates that preference modeling follows similar scaling laws as language modeling. Through large-scale training on 15M preference data, we reveal that preference models can learn unified preference representations.
๐ฏ Model Usage
Base Model and Fine-tuned Variants
WorldPM represents a breakthrough in unified preference representation learning through large-scale training. While our experiments demonstrate strong generalization capabilities across various preference scenarios, we recommend task-specific fine-tuning for optimal performance.
Base Model
- ๐ WorldPM-72B
Fine-tuned Versions
Each model is fine-tuned on human preference datasets of varying sizes:
| Model | Dataset | Training Scale |
|---|---|---|
| WorldPM-72B-HelpSteer2 | HelpSteer2 | 7K |
| WorldPM-72B-UltraFeedback | UltraFeedback | 100K |
| WorldPM-72B-RLHFLow | RLHFLow | 800K |
๐ก Recommendation
The base WorldPM-72B model serves as an excellent starting point for custom fine-tuning. Our experiments confirm that starting from WorldPM leads to better performance compared to training from scratch.
๐ ๏ธ Requirements
๐ Quick Start
๐ Key Notes
- System prompt remains empty during training (no preset preferences)
- System marker is preserved
- Reward computation uses the hidden state of
<|endoftext|>token
๐ป Usage Example with Hugging Face
from transformers import AutoModel, AutoTokenizer, AutoConfig
model_name = 'Qwen/WorldPM-72B-RLHFLow'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_name,trust_remote_code=True )
model = AutoModel.from_pretrained(
model_name,
config = config,
device_map = "auto",
trust_remote_code=True,
).eval()
def get_score(model, tokenizer, conversation):
con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False)
outputs = model(input_ids=input_ids.to(model.device))
score = outputs[0].cpu().item()
return score
messages = [
[
{'role': 'system', 'content': ""},
{'role': 'user', 'content': "Tips for better sleep?"},
{'role': 'assistant', 'content': "Three key tips for better sleep:\n1. Set a consistent bedtime and wake-up schedule\n2. Avoid screens 1 hour before bed\n3. Keep your bedroom cool and dark\n\nStart with these habits and you should notice improvement within a week."}
],
[
{'role': 'system', 'content': ""},
{'role': 'user', 'content': "Tips for better sleep?"},
{'role': 'assistant', 'content': "just try sleeping earlier or take some pills if u cant sleep lol. maybe watch tv until u get tired idk"}
]
]
score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages]
print(score)
๐ Citation
@article{WorldPM,
title={WorldPM:Scaling Human Preference Modeling},
author={Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin},
journal={arXiv preprint arXiv:2505.10527},
year={2025}
}
๐ค Community & Support
We welcome discussions and feedback from the community! Here's how you can reach out:
- ๐ Open an issue on GitHub for bug reports or feature requests
- ๐ก Share your ideas and questions in GitHub Discussions
- โ๏ธ Contact the authors directly at here
Feel free to engage with us through any of these channels. We value your input and look forward to hearing from you!
- Downloads last month
- 37
