YAML Metadata Warning:The task_categories "retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Ettin Pre-training Data
๐ License: MIT
๐ Paper
๐ Models
๐ GitHub
Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite.
This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.
๐ Data Composition
| Data Source | Tokens (B) | Percentage | Description |
|---|---|---|---|
| DCLM | 837.2 | 49.1% | High-quality web crawl data |
| CC Head | 356.6 | 20.9% | Common Crawl head documents |
| Starcoder | 263.9 | 15.5% | Code repositories and files |
| 80.3 | 4.7% | Social discussion threads | |
| PeS2o | 57.3 | 3.4% | Scientific papers |
| Arxiv | 28.0 | 1.6% | Academic preprints |
| StackExchange | 19.6 | 1.2% | Q&A forums |
| Tulu Flan | 16.6 | 1.0% | Instruction-following data |
| Open-Web-Math | 12.7 | 0.7% | Mathematical content |
| Algebraic StackExchange | 12.6 | 0.7% | Math Q&A |
| CC News | 7.3 | 0.4% | News articles |
| Wikipedia | 7.3 | 0.4% | Encyclopedia articles |
| Total | 1,704.7 | 100.0% | Diverse mixture for foundation training |
๐ Usage
For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT
Direct Access
from streaming import StreamingDataset
# Load the streaming dataset
dataset = StreamingDataset(
remote='https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data',
local='/tmp/ettin-pretraining-data',
shuffle=True
)
# Access samples
for sample in dataset:
text = sample['text']
# Process your data...
๐ Structure
Each folder contains one data source in MDS (Mosaic Data Shard) format:
arxiv/- Academic papers from ArXivbooks/- Literature and reference bookscc_head/- High-quality Common Crawl documentscc_news/- News articles from Common Crawldclm/- DataComp-LM filtered web dataopen_web_math/- Mathematical web contentalgebraic_stackexchange/- Math Q&A from StackExchangepes2o/- Scientific papers (PeS2o dataset)reddit/- Reddit discussion threadsstackexchange/- General StackExchange Q&Astarcoder/- Code from GitHub repositoriestulu_flan/- Instruction-following exampleswikipedia/- Wikipedia articles
๐ Related Resources
- Models: Ettin Model Suite (17M-1B parameters)
- Phase 2: Mid-training Data (250B tokens)
- Phase 3: Decay Phase Data (50B tokens)
- Training Order: Batch-level Data Order
- Paper: Arxiv link
- Code: GitHub Repository
Citation
@misc{weller2025seqvsseqopen,
title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2507.11412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11412},
}
- Downloads last month
- 39,556
