🔥 Chat with Magpie Here!
🐦 Llama-3-8B-Magpie-Align-v0.1
Project Web: https://magpie-align.github.io/
Online Model Demo: https://huggingface.co/spaces/flydust/Chat-with-Magpie
Arxiv Technical Report: https://arxiv.org/abs/2406.08464
Codes: https://github.com/magpie-align/magpie
Model Overview
This model is an aligned version of meta-llama/Meta-Llama-3-8B. We apply the following pipeline:
- We first use Magpie-Align/Magpie-Pro-MT-300K-v0.1 dataset and perform SFT -> Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.1
- We then perform DPO on the princeton-nlp/llama3-ultrafeedback dataset.
The overall performance is even better than the official Llama-3-8B-Instruct Model!
- Alpaca Eval 2 (vs GPT-4-Turbo-1106): 38.52 (LC), 38.47 (WR)
- Alpaca Eval 2 (vs Llama-3-8B-Instruct): 69.37 (LC), 70.05 (WR)
- Arena Hard: 32.4
- WildBench: 39.3 ((was) Best <30B Model! 🏆)
- Zero-Eval GSM: 54.62
Model Performance
We compare our Llama-3-8B-Magpie-Align with official and other open-aligned LLMs that have been fine-tuned from base models and have publicly released their training datasets. The results are as follows:
+---------------------------------------------+--------------------+--------------------+-----------------------+------------+
| Aligned Model ID | MT-Bench | Alpaca Eval 2 | Alpaca Eval 2 | Arena Hard |
| | | (GPT-4-Turbo-1106) | (Llama-3-8B-Instruct) | |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| | R1 | R2 | AVG | LC WR | WR | LC WR | WR | Score |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| meta-llama/Meta-Llama-3-8B-Instruct | 8.31 | 7.65 | 7.98 | 22.92 | 22.57 | 50 | 50 | 20.6 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| princeton-nlp/Llama-3-Base-8B-SFT-DPO | 8.12 | 7.23 | 7.67 | 17.71 | 15.34 | 43.73 | 38.80 | 14.8 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| NousResearch/Hermes-2-Pro-Llama-3-8B | 8.05 | 7.35 | 7.70 | 15.60 | 12.86 | 36.37 | 30.52 | 11.5 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| allenai/llama-3-tulu-2-dpo-8b | 7.71 | 7.15 | 7.43 | 14.89 | 14.80 | 35.43 | 35.42 | 11.7 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| cognitivecomputations/dolphin-2.9-llama3-8b | 7.97 | 6.98 | 7.47 | 12.50 | 8.79 | 32.67 | 22.80 | 8.2 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| openchat/openchat-3.6-8b-20240522 | 7.83 | 7.23 | 7.53 | 17.70 | 12.53 | 41.30 | 30.79 | 6.7 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| Magpie-Align/Llama-3-8B-Magpie-Align-v0.1 | 8.01 | 7.63 | 7.82 | 38.52 | 38.47 | 69.37 | 70.05 | 32.4 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| Magpie-Align/Llama-3-8B-Magpie-Align-v0.2 | 7.81 | 7.64 | 7.73 | 49.86 | 51.98 | 75.17 | 78.20 | 37.5 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
👀 Other Information
License: Please follow Meta Llama 3 Community License.
Conversation Template: Please use Llama 3 official chat template for the best performance.
How to use it? Please check the official Llama 3 repository for detailed instructions. Simply replace the original model_id with Magpie-Align/Llama-3-8B-Magpie-Align-v0.1.
The detailed training pipeline is as follows.
Stage 1: Supervised Fine-tuning
We use Axolotl for SFT.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 2
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.8807 | 0.0007 | 1 | 0.9001 |
| 0.5113 | 0.3337 | 464 | 0.5178 |
| 0.4668 | 0.6673 | 928 | 0.4792 |
| 0.4492 | 1.0010 | 1392 | 0.4582 |
| 0.3498 | 1.3205 | 1856 | 0.4575 |
| 0.3525 | 1.6542 | 2320 | 0.4555 |
Framework versions
- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
Stage 2: Direct Preference Optimization
We use alignment handbook for DPO.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-06
- train_batch_size: 2
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 16
- total_train_batch_size: 128
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Training results
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.628 | 0.2138 | 100 | 0.6641 | -0.8806 | -1.0146 | 0.6240 | 0.1340 | -362.7133 | -343.6060 | -0.7539 | -0.7528 |
| 0.6935 | 0.4275 | 200 | 0.6352 | -1.3660 | -1.6311 | 0.6545 | 0.2651 | -424.3628 | -392.1437 | -0.6649 | -0.6629 |
| 0.6376 | 0.6413 | 300 | 0.6178 | -1.3533 | -1.6413 | 0.6748 | 0.2880 | -425.3859 | -390.8818 | -0.6753 | -0.6758 |
| 0.5888 | 0.8550 | 400 | 0.6088 | -1.6321 | -1.9785 | 0.6829 | 0.3464 | -459.1051 | -418.7560 | -0.6440 | -0.6435 |
It achieves the following results on the evaluation set:
- Loss: 0.6084
- Rewards/chosen: -1.6265
- Rewards/rejected: -1.9735
- Rewards/accuracies: 0.6809
- Rewards/margins: 0.3470
- Logps/rejected: -458.6070
- Logps/chosen: -418.2021
- Logits/rejected: -0.6447
- Logits/chosen: -0.6439
Framework versions
- Transformers 4.41.2
- Pytorch 2.3.1+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1
Downstream Performance
| Datasets | Llama-3-8B-Magpie-Align-v0.1 |
|---|---|
| MMLU (5) | 64.61 |
| ARC (25) | 62.03 |
| HellaSwag (25) | 82.10 |
| TruthfulQA (0) | 58.26 |
| Winogrande (5) | 73.01 |
Paper Abstract
📚 Citation
If you find the model, data, or code useful, please cite our paper:
@article{xu2024magpie,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please also cite the creators of preference datasets:
SimPO paper:
@article{meng2024simpo,
title={{SimPO}: Simple preference optimization with a reference-free reward},
author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
journal={arXiv preprint arXiv:2405.14734},
year={2024}
}
UltraFeedback paper:
@article{cui2023ultrafeedback,
title={{UltraFeedback}: Boosting language models with high-quality feedback},
author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2310.01377},
year={2023}
}
ArmoRM paper:
@article{wang2024interpretable,
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
journal={arXiv preprint arXiv:2406.12845},
year={2024}
}
Questions? Please contact Zhangchen by email.
- Downloads last month
- 14
Model tree for Magpie-Align/Llama-3-8B-Magpie-Align-v0.1
Base model
meta-llama/Meta-Llama-3-8B