Dataset Viewer

YAML Metadata Warning:The task_categories "retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Ettin Pre-training Data

👁 License: MIT
👁 Paper
👁 Models
👁 GitHub

Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite.

This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.

📊 Data Composition

Data Source	Tokens (B)	Percentage	Description
DCLM	837.2	49.1%	High-quality web crawl data
CC Head	356.6	20.9%	Common Crawl head documents
Starcoder	263.9	15.5%	Code repositories and files
Reddit	80.3	4.7%	Social discussion threads
PeS2o	57.3	3.4%	Scientific papers
Arxiv	28.0	1.6%	Academic preprints
StackExchange	19.6	1.2%	Q&A forums
Tulu Flan	16.6	1.0%	Instruction-following data
Open-Web-Math	12.7	0.7%	Mathematical content
Algebraic StackExchange	12.6	0.7%	Math Q&A
CC News	7.3	0.4%	News articles
Wikipedia	7.3	0.4%	Encyclopedia articles
Total	1,704.7	100.0%	Diverse mixture for foundation training

🚀 Usage

For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT

Direct Access

from streaming import StreamingDataset

# Load the streaming dataset
dataset = StreamingDataset(
 remote='https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data',
 local='/tmp/ettin-pretraining-data',
 shuffle=True
)

# Access samples
for sample in dataset:
 text = sample['text']
 # Process your data...

📁 Structure

Each folder contains one data source in MDS (Mosaic Data Shard) format:

arxiv/ - Academic papers from ArXiv
books/ - Literature and reference books
cc_head/ - High-quality Common Crawl documents
cc_news/ - News articles from Common Crawl
dclm/ - DataComp-LM filtered web data
open_web_math/ - Mathematical web content
algebraic_stackexchange/ - Math Q&A from StackExchange
pes2o/ - Scientific papers (PeS2o dataset)
reddit/ - Reddit discussion threads
stackexchange/ - General StackExchange Q&A
starcoder/ - Code from GitHub repositories
tulu_flan/ - Instruction-following examples
wikipedia/ - Wikipedia articles

🔗 Related Resources

Models: Ettin Model Suite (17M-1B parameters)
Phase 2: Mid-training Data (250B tokens)
Phase 3: Decay Phase Data (50B tokens)
Training Order: Batch-level Data Order
Paper: Arxiv link
Code: GitHub Repository

Citation

@misc{weller2025seqvsseqopen,
 title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
 author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
 year={2025},
 eprint={2507.11412},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2507.11412}, 
}

Downloads last month: 39,556

Models trained or fine-tuned on jhu-clsp/ettin-pretraining-data

Text Generation • Updated Jul 18, 2025 • 6.07k • 4

Collection including jhu-clsp/ettin-pretraining-data

A collection of SOTA, open-data, paired encoder-only and decoder only models ranging from 17M params to 1B. See the paper at https://arxiv.org/abs/250 • 30 items • Updated Mar 2 • 31

Paper for jhu-clsp/ettin-pretraining-data

Paper • 2507.11412 • Published Jul 15, 2025 • 33

URL: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data

⇱ jhu-clsp/ettin-pretraining-data · Datasets at Hugging Face