VOOZH about

URL: https://huggingface.co/instruction-pretrain/instruction-synthesizer

⇱ instruction-pretrain/instruction-synthesizer · Hugging Face


Instruction Pre-Training: Language Models are Supervised Multitask Learners (EMNLP 2024)

This repo contains the context-based instruction synthesizer in our paper Instruction Pre-Training: Language Models are Supervised Multitask Learners.

We explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. Instruction Pre-Training outperforms Vanilla Pre-training in both general pre-training from scratch and domain-adaptive continual pre-training. In pre-training from scratch, Instruction Pre-Training not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B.

👁 Image

**************************** Updates ****************************

Resources

🤗 We share our data and models with example usages, feel free to open any discussions at this page! 🤗

Synthesize Instruction-Response Pairs to Augment Any Raw Corpora

We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text. The fine-tuning data are available at ft-instruction-synthesizer-collection

👁 Image

1. Basic Usage: Synthesize instruction-response pairs based on a given raw text

💗 Here is an amazing demo that implements our approach: davanstrien/instruction-synthesizer 💗

2. Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale

We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.

Pre-Training Suggestions:

Except for the pre-training data, Instruction Pre-Training keeps all other settings the same as Vanilla Pre-Training.

Therefore, you can easily use any training framework, such as OLMo (for pre-training from scratch) and LLaMA-Factory (for continual pre-training), to train on the templified instruction-augmented corpora.

  1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
  2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens). Each example from OpenOrca is formulated as "{question} {response}", with a white-space used to connect the question and response.

Let's try our method in continual pre-training for a quick start---it works easily!

Feel free to ask for any suggestions at this page; we will reply ASAP🤗!

FAQ on Continual Pre-Training from LLama3

Q1: Do you use the official Llama3 instruction prompt for pre-training?

No, the provided Llama3 instruction prompt is designed for the instruction-tuned model, but our continual pre-training is conducted on the pre-trained base model where only BOS (<|begin_of_text|>) and EOS (<|end_of_text|>) tokens are required.

Q2: For the general instructions from OpenOrca, do you concatenate each instruction with its output using '\n'?

No, as mentioned in the pre-training suggestions, we use a simple whitespace to concatenate each question with its response for the general instruction data from OpenOrca. This is because OpenOrca's data is already templated with diverse natural languge templates (such as those with \n), so a whitespace is sufficient to formulate the data.

Note that when using our templated instruction-augmented texts, you don't need to add any concatenations.

Q3: What about those system prompts in OpenOrca?

We simply discard the system prompts.

To put it all together, the text before tokenization looks like this:

general_instruction_response_text = "<|begin_of_text|>{question} {response}<|end_of_text|>"

instruction_augmented_text = "<|begin_of_text|>{instruction augmented text}<|end_of_text|>"

Then, for tokenization, you don't need to add BOS and EOS token ids. The tokenization code looks like this:

text_ids = tokenizer(text, add_special_tokens=False, **kwargs).input_ids

Citation

If you find our work helpful, please cite us:

Instruction Pre-Training (EMNLP 2024)

@article{cheng2024instruction,
 title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
 author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
 journal={arXiv preprint arXiv:2406.14491},
 year={2024}
}

Adapt LLM to Domains (ICLR 2024)

@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
Downloads last month
10
Safetensors
Model size
7B params
Tensor type
F32
·

Model tree for instruction-pretrain/instruction-synthesizer

Quantizations
6 models

Dataset used to train instruction-pretrain/instruction-synthesizer

Spaces using instruction-pretrain/instruction-synthesizer 2

Papers for instruction-pretrain/instruction-synthesizer