VOOZH about

URL: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data

โ‡ฑ jhu-clsp/ettin-pretraining-data ยท Datasets at Hugging Face


Dataset Viewer

YAML Metadata Warning:The task_categories "retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Ettin Pre-training Data

๐Ÿ‘ License: MIT
๐Ÿ‘ Paper
๐Ÿ‘ Models
๐Ÿ‘ GitHub

Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite.

This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.

๐Ÿ“Š Data Composition

Data Source Tokens (B) Percentage Description
DCLM 837.2 49.1% High-quality web crawl data
CC Head 356.6 20.9% Common Crawl head documents
Starcoder 263.9 15.5% Code repositories and files
Reddit 80.3 4.7% Social discussion threads
PeS2o 57.3 3.4% Scientific papers
Arxiv 28.0 1.6% Academic preprints
StackExchange 19.6 1.2% Q&A forums
Tulu Flan 16.6 1.0% Instruction-following data
Open-Web-Math 12.7 0.7% Mathematical content
Algebraic StackExchange 12.6 0.7% Math Q&A
CC News 7.3 0.4% News articles
Wikipedia 7.3 0.4% Encyclopedia articles
Total 1,704.7 100.0% Diverse mixture for foundation training

๐Ÿš€ Usage

For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT

Direct Access

from streaming import StreamingDataset

# Load the streaming dataset
dataset = StreamingDataset(
 remote='https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data',
 local='/tmp/ettin-pretraining-data',
 shuffle=True
)

# Access samples
for sample in dataset:
 text = sample['text']
 # Process your data...

๐Ÿ“ Structure

Each folder contains one data source in MDS (Mosaic Data Shard) format:

  • arxiv/ - Academic papers from ArXiv
  • books/ - Literature and reference books
  • cc_head/ - High-quality Common Crawl documents
  • cc_news/ - News articles from Common Crawl
  • dclm/ - DataComp-LM filtered web data
  • open_web_math/ - Mathematical web content
  • algebraic_stackexchange/ - Math Q&A from StackExchange
  • pes2o/ - Scientific papers (PeS2o dataset)
  • reddit/ - Reddit discussion threads
  • stackexchange/ - General StackExchange Q&A
  • starcoder/ - Code from GitHub repositories
  • tulu_flan/ - Instruction-following examples
  • wikipedia/ - Wikipedia articles

๐Ÿ”— Related Resources

Citation

@misc{weller2025seqvsseqopen,
 title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
 author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
 year={2025},
 eprint={2507.11412},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2507.11412}, 
}
Downloads last month
39,556

Models trained or fine-tuned on jhu-clsp/ettin-pretraining-data

Collection including jhu-clsp/ettin-pretraining-data

Paper for jhu-clsp/ettin-pretraining-data