Dataset Preview

Duplicate

id string	text string	added string	created string
proofpile-arXiv_065-0	\section{Introduction} With the explosive growth of Internet of Things (IoT) devices, wireless communication networks (WCNs) are increasingly facing the challenge of allocating finite transmit power and bandwidth for system utility maximization~\cite{xu2021survey}. Accordingly, one needs to design advanced radio resou...	2024-02-18T23:39:39.769Z	2022-07-05T02:23:46.000Z
proofpile-arXiv_065-1	\section{Introduction} Vector Quantised Variational AutoEncoder (VQ-VAE) ~\cite{van2017neural} is a popular method developed to compress images into discrete representations for the generation. Typically, after the compression and discretization representation by the convolutional network, an autoregressive model i...	2024-02-18T23:39:39.773Z	2021-12-06T02:16:48.000Z
proofpile-arXiv_065-2	"\\section{Introduction}\nBlazars are the most extreme subclass of active galactic nuclei (AGN) with(...TRUNCATED)	2024-02-18T23:39:39.775Z	2021-12-06T02:13:16.000Z
proofpile-arXiv_065-3	"\\section{Introduction}\n\\label{intro}\nThe astrophysical plasmas characterized by high Lundquist (...TRUNCATED)	2024-02-18T23:39:39.779Z	2021-12-06T02:15:59.000Z
proofpile-arXiv_065-4	"\\section{Introduction}\\label{sec:intro}\n\n\nSpace provides a useful vantage point for monitoring(...TRUNCATED)	2024-02-18T23:39:39.782Z	2021-12-06T02:11:43.000Z
proofpile-arXiv_065-5	"\\section{Limitations and Conclusion}\n\\label{sec:conclusion}\n\nA major limitation of NeRF-SR{} i(...TRUNCATED)	2024-02-18T23:39:39.785Z	2022-07-22T02:23:05.000Z
proofpile-arXiv_065-6	"\\section{Introduction}\n\nMachine Learning (ML) applications recently demonstrated widespread adop(...TRUNCATED)	2024-02-18T23:39:39.787Z	2021-12-06T02:15:43.000Z
proofpile-arXiv_065-7	"\\section{Introduction}\nSurface codes are an important class of error correcting codes in fault to(...TRUNCATED)	2024-02-18T23:39:39.790Z	2021-12-28T02:11:58.000Z
proofpile-arXiv_065-8	"\\section{Introduction}\n\\label{sec:intro}\nThere are numerous links between probabilistic cellula(...TRUNCATED)	2024-02-18T23:39:39.793Z	2022-03-29T02:19:48.000Z
proofpile-arXiv_065-9	"\\section{Introduction} \\label{intro}}\n\n\\IEEEPARstart{F}{ace} detection, one of the most popula(...TRUNCATED)	2024-02-18T23:39:39.796Z	2021-12-06T02:16:01.000Z

End of preview.

OLMoE Mix (September 2024)

👁 OLMoE Mix Logo.

The following data mix was used to train OLMoE-1B-7B, a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024.

The base version of OLMoE-1B-7B can be found at this page, the SFT of OLMoE-1B-7B is available here, and a version combining SFT and DPO is available following this link.

Statistics

Subset	Tokens	Words	Bytes	Docs
DCLM Baseline 1.0	3.86 T	3.38 T	16.7 T	2.95 B
Starcoder	101 B	63.9 B	325 B	78.7 M
peS2o (Dolma)	57.2 B	51.3 B	268 B	38.8 M
Arxiv (RedPajama v1 via Proof Pile II)	21.1 B	23.5 B	88.8 B	1.55 M
OpenWebMath (Proof Pile II)	12.7 B	10.2 B	42.4 B	2.91 M
Algebraic Stack (Proof Pile II)	12.6 B	9.6 B	39.3 B	2.83 M
En Wikipedia + Wikibooks (Dolma)	3.69 B	3.16 B	16.2 B	6.17 M
Total	4.07 T	3.53 T	17.4 T	3.08 B

Preprocessing

All subsets were pre-processed to remove documents with a sequence of 32 or more repeated ngrams.

a ngram is a span of 1 to 13 tokens, included;
tokens are obtained using the model tokenizer;
a sequence is a contiguous span of repeated ngrams.

In addition of the above, Starcoder dataset was further processed by removing any document meeting any of the following rules:

document is from a repository with fewer than 2 stars on GitHub;
the top most frequent word in the document constitutes over 30% of the document;
the two most frequent words in the document constitutes over 50% of the document.

Licensing Information

This mix is licensed under Open Data Commons Attribution License (ODC-By) v1.0. By using this dataset, you are bound to licenses and Terms of Services of underlying datasets, which you can access by clicking on the links in the table above.

Citation

@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
 title={OLMoE: Open Mixture-of-Experts Language Models}, 
 author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
 year={2024},
 eprint={2409.02060},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2409.02060}, 
}