10 items • Updated • 1
This is a model released for our paper: REBEL: Reinforcement Learning via Regressing Relative Rewards.
REBEL-Llama-3-Armo-iter_1
This model is developed with REBEL based on Meta-Llama-3-8B-Instruct with ArmoRM-Llama3-8B-v0.1 as the reward model and UltraFeedback dataset. The training code is available at https://github.com/ZhaolinGao/REBEL. We collect offline generations of the entire dataset with best-of-5 as the chosen response and worst-of-5 as the rejected response (Ultrafeedback-Llama-3-Armo-iter_1).
Links to Other Model
Evaluations
| Model | AlpacaEval 2.0 LC Win Rate |
AlpacaEval 2.0 Win Rate |
MT-Bench Average |
MMLU (5-shot) |
GSM8K (5-shot) |
|---|---|---|---|---|---|
| REBEL-OpenChat-3.5 | 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
| REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
| REBEL-Llama-3-epoch_2 | 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
| REBEL-Llama-3-Armo-iter_1 | 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
| REBEL-Llama-3-Armo-iter_2 | 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
| REBEL-Llama-3-Armo-iter_3 | 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
Citation
Please cite our paper if you use this model in your own work:
@misc{gao2024rebel,
title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
year={2024},
eprint={2404.16767},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- Downloads last month
- 4
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Cornell-AGI/REBEL-Llama-3-Armo-iter_1
Base model
meta-llama/Meta-Llama-3-8B-Instruct