VOOZH about

URL: https://huggingface.co/Qwen/WorldPM-72B-RLHFLow

โ‡ฑ Qwen/WorldPM-72B-RLHFLow ยท Hugging Face


WorldPM-72B-RLHFLow ๐ŸŒ

๐Ÿ‘ License
๐Ÿ‘ arXiv
๐Ÿ‘ GitHub
๐Ÿ‘ ModelScope

๐Ÿ“š Introduction

๐Ÿ“„ WorldPM (World Preference Modeling) demonstrates that preference modeling follows similar scaling laws as language modeling. Through large-scale training on 15M preference data, we reveal that preference models can learn unified preference representations.

๐Ÿ‘ main-loss

๐ŸŽฏ Model Usage

Base Model and Fine-tuned Variants

WorldPM represents a breakthrough in unified preference representation learning through large-scale training. While our experiments demonstrate strong generalization capabilities across various preference scenarios, we recommend task-specific fine-tuning for optimal performance.

Base Model

Fine-tuned Versions

Each model is fine-tuned on human preference datasets of varying sizes:

๐Ÿ’ก Recommendation

The base WorldPM-72B model serves as an excellent starting point for custom fine-tuning. Our experiments confirm that starting from WorldPM leads to better performance compared to training from scratch.

๐Ÿ› ๏ธ Requirements

๐Ÿ‘ transformers

๐Ÿš€ Quick Start

๐Ÿ“‹ Key Notes

  • System prompt remains empty during training (no preset preferences)
  • System marker is preserved
  • Reward computation uses the hidden state of <|endoftext|> token

๐Ÿ’ป Usage Example with Hugging Face

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = 'Qwen/WorldPM-72B-RLHFLow'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_name,trust_remote_code=True )
model = AutoModel.from_pretrained(
 model_name, 
 config = config, 
 device_map = "auto", 
 trust_remote_code=True,
).eval()


def get_score(model, tokenizer, conversation):
 con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
 input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False)
 outputs = model(input_ids=input_ids.to(model.device))
 score = outputs[0].cpu().item()
 return score


messages = [
 [
 {'role': 'system', 'content': ""},
 {'role': 'user', 'content': "Tips for better sleep?"},
 {'role': 'assistant', 'content': "Three key tips for better sleep:\n1. Set a consistent bedtime and wake-up schedule\n2. Avoid screens 1 hour before bed\n3. Keep your bedroom cool and dark\n\nStart with these habits and you should notice improvement within a week."}
 ],
 [
 {'role': 'system', 'content': ""},
 {'role': 'user', 'content': "Tips for better sleep?"},
 {'role': 'assistant', 'content': "just try sleeping earlier or take some pills if u cant sleep lol. maybe watch tv until u get tired idk"}
 ]
]

score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages]

print(score) 

๐Ÿ“ Citation

@article{WorldPM,
 title={WorldPM:Scaling Human Preference Modeling}, 
 author={Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin},
 journal={arXiv preprint arXiv:2505.10527},
 year={2025}
}

๐Ÿค Community & Support

We welcome discussions and feedback from the community! Here's how you can reach out:

  • ๐Ÿ“ Open an issue on GitHub for bug reports or feature requests
  • ๐Ÿ’ก Share your ideas and questions in GitHub Discussions
  • โœ‰๏ธ Contact the authors directly at here

Feel free to engage with us through any of these channels. We value your input and look forward to hearing from you!

Downloads last month
37
Safetensors
Model size
73B params
Tensor type
BF16
ยท

Model tree for Qwen/WorldPM-72B-RLHFLow

Base model

Qwen/Qwen2.5-72B
Finetuned
Qwen/WorldPM-72B
Finetuned
(4)
this model

Dataset used to train Qwen/WorldPM-72B-RLHFLow

Collection including Qwen/WorldPM-72B-RLHFLow

Paper for Qwen/WorldPM-72B-RLHFLow