VOOZH about

URL: https://huggingface.co/datasets/allenai/OLMoE-mix-0924

⇱ allenai/OLMoE-mix-0924 · Datasets at Hugging Face


Dataset Preview
Duplicate
id
string
text
string
added
string
created
string
proofpile-arXiv_065-0
\section{Introduction} With the explosive growth of Internet of Things (IoT) devices, wireless communication networks (WCNs) are increasingly facing the challenge of allocating finite transmit power and bandwidth for system utility maximization~\cite{xu2021survey}. Accordingly, one needs to design advanced radio resou...
2024-02-18T23:39:39.769Z
2022-07-05T02:23:46.000Z
proofpile-arXiv_065-1
\section{Introduction} Vector Quantised Variational AutoEncoder (VQ-VAE) ~\cite{van2017neural} is a popular method developed to compress images into discrete representations for the generation. Typically, after the compression and discretization representation by the convolutional network, an autoregressive model i...
2024-02-18T23:39:39.773Z
2021-12-06T02:16:48.000Z
proofpile-arXiv_065-2
"\\section{Introduction}\nBlazars are the most extreme subclass of active galactic nuclei (AGN) with(...TRUNCATED)
2024-02-18T23:39:39.775Z
2021-12-06T02:13:16.000Z
proofpile-arXiv_065-3
"\\section{Introduction}\n\\label{intro}\nThe astrophysical plasmas characterized by high Lundquist (...TRUNCATED)
2024-02-18T23:39:39.779Z
2021-12-06T02:15:59.000Z
proofpile-arXiv_065-4
"\\section{Introduction}\\label{sec:intro}\n\n\nSpace provides a useful vantage point for monitoring(...TRUNCATED)
2024-02-18T23:39:39.782Z
2021-12-06T02:11:43.000Z
proofpile-arXiv_065-5
"\\section{Limitations and Conclusion}\n\\label{sec:conclusion}\n\nA major limitation of NeRF-SR{} i(...TRUNCATED)
2024-02-18T23:39:39.785Z
2022-07-22T02:23:05.000Z
proofpile-arXiv_065-6
"\\section{Introduction}\n\nMachine Learning (ML) applications recently demonstrated widespread adop(...TRUNCATED)
2024-02-18T23:39:39.787Z
2021-12-06T02:15:43.000Z
proofpile-arXiv_065-7
"\\section{Introduction}\nSurface codes are an important class of error correcting codes in fault to(...TRUNCATED)
2024-02-18T23:39:39.790Z
2021-12-28T02:11:58.000Z
proofpile-arXiv_065-8
"\\section{Introduction}\n\\label{sec:intro}\nThere are numerous links between probabilistic cellula(...TRUNCATED)
2024-02-18T23:39:39.793Z
2022-03-29T02:19:48.000Z
proofpile-arXiv_065-9
"\\section{Introduction} \\label{intro}}\n\n\\IEEEPARstart{F}{ace} detection, one of the most popula(...TRUNCATED)
2024-02-18T23:39:39.796Z
2021-12-06T02:16:01.000Z
End of preview.

OLMoE Mix (September 2024)

👁 OLMoE Mix Logo.

The following data mix was used to train OLMoE-1B-7B, a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024.

The base version of OLMoE-1B-7B can be found at this page, the SFT of OLMoE-1B-7B is available here, and a version combining SFT and DPO is available following this link.

Statistics

Subset Tokens Words Bytes Docs
DCLM Baseline 1.0 3.86 T 3.38 T 16.7 T 2.95 B
Starcoder 101 B 63.9 B 325 B 78.7 M
peS2o
(Dolma)
57.2 B 51.3 B 268 B 38.8 M
Arxiv
(RedPajama v1
via Proof Pile II)
21.1 B 23.5 B 88.8 B 1.55 M
OpenWebMath
(Proof Pile II)
12.7 B 10.2 B 42.4 B 2.91 M
Algebraic Stack
(Proof Pile II)
12.6 B 9.6 B 39.3 B 2.83 M
En Wikipedia +
Wikibooks
(Dolma)
3.69 B 3.16 B 16.2 B 6.17 M
Total 4.07 T 3.53 T 17.4 T 3.08 B

Preprocessing

All subsets were pre-processed to remove documents with a sequence of 32 or more repeated ngrams.

  • a ngram is a span of 1 to 13 tokens, included;
  • tokens are obtained using the model tokenizer;
  • a sequence is a contiguous span of repeated ngrams.

In addition of the above, Starcoder dataset was further processed by removing any document meeting any of the following rules:

  • document is from a repository with fewer than 2 stars on GitHub;
  • the top most frequent word in the document constitutes over 30% of the document;
  • the two most frequent words in the document constitutes over 50% of the document.

Licensing Information

This mix is licensed under Open Data Commons Attribution License (ODC-By) v1.0. By using this dataset, you are bound to licenses and Terms of Services of underlying datasets, which you can access by clicking on the links in the table above.

Citation

@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
 title={OLMoE: Open Mixture-of-Experts Language Models}, 
 author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
 year={2024},
 eprint={2409.02060},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2409.02060}, 
}
Downloads last month
2,954

Models trained or fine-tuned on allenai/OLMoE-mix-0924

Collections including allenai/OLMoE-mix-0924

Paper for allenai/OLMoE-mix-0924