🔥 Chat with Magpie Here!

🐦 Llama-3-8B-Magpie-Align-v0.1

Project Web: https://magpie-align.github.io/

Online Model Demo: https://huggingface.co/spaces/flydust/Chat-with-Magpie

Arxiv Technical Report: https://arxiv.org/abs/2406.08464

Codes: https://github.com/magpie-align/magpie

Model Overview

This model is an aligned version of meta-llama/Meta-Llama-3-8B. We apply the following pipeline:

We first use Magpie-Align/Magpie-Pro-MT-300K-v0.1 dataset and perform SFT -> Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.1
We then perform DPO on the princeton-nlp/llama3-ultrafeedback dataset.

The overall performance is even better than the official Llama-3-8B-Instruct Model!

Alpaca Eval 2 (vs GPT-4-Turbo-1106): 38.52 (LC), 38.47 (WR)
Alpaca Eval 2 (vs Llama-3-8B-Instruct): 69.37 (LC), 70.05 (WR)
Arena Hard: 32.4
WildBench: 39.3 ((was) Best <30B Model! 🏆)
Zero-Eval GSM: 54.62

Model Performance

We compare our Llama-3-8B-Magpie-Align with official and other open-aligned LLMs that have been fine-tuned from base models and have publicly released their training datasets. The results are as follows:

+---------------------------------------------+--------------------+--------------------+-----------------------+------------+
| Aligned Model ID | MT-Bench | Alpaca Eval 2 | Alpaca Eval 2 | Arena Hard |
| | | (GPT-4-Turbo-1106) | (Llama-3-8B-Instruct) | |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| | R1 | R2 | AVG | LC WR | WR | LC WR | WR | Score |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| meta-llama/Meta-Llama-3-8B-Instruct | 8.31 | 7.65 | 7.98 | 22.92 | 22.57 | 50 | 50 | 20.6 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| princeton-nlp/Llama-3-Base-8B-SFT-DPO | 8.12 | 7.23 | 7.67 | 17.71 | 15.34 | 43.73 | 38.80 | 14.8 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| NousResearch/Hermes-2-Pro-Llama-3-8B | 8.05 | 7.35 | 7.70 | 15.60 | 12.86 | 36.37 | 30.52 | 11.5 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| allenai/llama-3-tulu-2-dpo-8b | 7.71 | 7.15 | 7.43 | 14.89 | 14.80 | 35.43 | 35.42 | 11.7 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| cognitivecomputations/dolphin-2.9-llama3-8b | 7.97 | 6.98 | 7.47 | 12.50 | 8.79 | 32.67 | 22.80 | 8.2 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| openchat/openchat-3.6-8b-20240522 | 7.83 | 7.23 | 7.53 | 17.70 | 12.53 | 41.30 | 30.79 | 6.7 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| Magpie-Align/Llama-3-8B-Magpie-Align-v0.1 | 8.01 | 7.63 | 7.82 | 38.52 | 38.47 | 69.37 | 70.05 | 32.4 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
| Magpie-Align/Llama-3-8B-Magpie-Align-v0.2 | 7.81 | 7.64 | 7.73 | 49.86 | 51.98 | 75.17 | 78.20 | 37.5 |
+---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+

👀 Other Information

License: Please follow Meta Llama 3 Community License.

Conversation Template: Please use Llama 3 official chat template for the best performance.

How to use it? Please check the official Llama 3 repository for detailed instructions. Simply replace the original model_id with Magpie-Align/Llama-3-8B-Magpie-Align-v0.1.

The detailed training pipeline is as follows.

Stage 1: Supervised Fine-tuning

We use Axolotl for SFT.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 32
total_eval_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss
0.8807	0.0007	1	0.9001
0.5113	0.3337	464	0.5178
0.4668	0.6673	928	0.4792
0.4492	1.0010	1392	0.4582
0.3498	1.3205	1856	0.4575
0.3525	1.6542	2320	0.4555

Framework versions

Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

👁 Built with Axolotl

Stage 2: Direct Preference Optimization

We use alignment handbook for DPO.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 16
total_train_batch_size: 128
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.628	0.2138	100	0.6641	-0.8806	-1.0146	0.6240	0.1340	-362.7133	-343.6060	-0.7539	-0.7528
0.6935	0.4275	200	0.6352	-1.3660	-1.6311	0.6545	0.2651	-424.3628	-392.1437	-0.6649	-0.6629
0.6376	0.6413	300	0.6178	-1.3533	-1.6413	0.6748	0.2880	-425.3859	-390.8818	-0.6753	-0.6758
0.5888	0.8550	400	0.6088	-1.6321	-1.9785	0.6829	0.3464	-459.1051	-418.7560	-0.6440	-0.6435

It achieves the following results on the evaluation set:

Loss: 0.6084
Rewards/chosen: -1.6265
Rewards/rejected: -1.9735
Rewards/accuracies: 0.6809
Rewards/margins: 0.3470
Logps/rejected: -458.6070
Logps/chosen: -418.2021
Logits/rejected: -0.6447
Logits/chosen: -0.6439

Framework versions

Transformers 4.41.2
Pytorch 2.3.1+cu121
Datasets 2.20.0
Tokenizers 0.19.1

Downstream Performance

Datasets	Llama-3-8B-Magpie-Align-v0.1
MMLU (5)	64.61
ARC (25)	62.03
HellaSwag (25)	82.10
TruthfulQA (0)	58.26
Winogrande (5)	73.01

Paper Abstract

📚 Citation

If you find the model, data, or code useful, please cite our paper:

@article{xu2024magpie,
 title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing}, 
 author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
 year={2024},
 eprint={2406.08464},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}

Please also cite the creators of preference datasets:

SimPO paper:

@article{meng2024simpo,
 title={{SimPO}: Simple preference optimization with a reference-free reward},
 author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
 journal={arXiv preprint arXiv:2405.14734},
 year={2024}
}

UltraFeedback paper:

@article{cui2023ultrafeedback,
 title={{UltraFeedback}: Boosting language models with high-quality feedback},
 author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
 journal={arXiv preprint arXiv:2310.01377},
 year={2023}
}

ArmoRM paper:

@article{wang2024interpretable,
 title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
 author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
 journal={arXiv preprint arXiv:2406.12845},
 year={2024}
}

Questions? Please contact Zhangchen by email.