Dataset Viewer

mmBERT Pre-training Data P1

👁 License: MIT
👁 Paper
👁 Models
👁 GitHub

Phase 1 of 3: Diverse multilingual pre-training data mixture (trained for 2.3T tokens) used to train the mmBERT model suite.

NOTE: this is only P1 of the pre-training data due to HF limits, you need to download and combine all three into one folder

This dataset contains the pre-training phase data used to train all mmBERT encoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.

📊 Data Composition

Data Source	Tokens (B)	Percentage	Description
FineWeb2	1,196.6	60.2%	High-quality multilingual web crawl data
DCLM	600.0	30.2%	High-quality English web crawl data
Starcoder	100.6	5.1%	Code repositories and files
Arxiv	27.8	1.4%	Academic preprints
StackExchange	18.6	0.9%	Q&A forums
Tulu Flan	15.3	0.8%	Instruction-following data
Dolmino Math	11.2	0.6%	Mathematical content
PeS2o	8.4	0.4%	Scientific papers
Wikipedia (MegaWika)	4.7	0.2%	Encyclopedia articles
Books	4.3	0.2%	Literature and reference books
StackExchange (Dolmino)	1.4	0.1%	Curated Q&A content
Total	1,989.0	100.0%	Diverse mixture for foundation training

🌍 Language Coverage

This phase covers 60 languages plus code, with an inverse temperature sampling schedule starting at τ=0.7. Languages include:

High-resource: English (34.5%), Russian (5.8%), German (4.4%), Spanish (4.5%), French (4.0%), Chinese (5.2%)
Mid-resource: Italian, Portuguese, Japanese, Dutch, Polish, and 45 others
Scripts: Latin, Cyrillic, Arabic, Chinese, Japanese, Thai, and many more

🚀 Usage

For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT

Direct Access

Use the script at this link to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's Snapshot Download.

🔗 Related Resources

Models: mmBERT Model Suite
Phase 2: Mid-training Data (600B tokens)
Phase 3: Decay Phase Data (100B tokens)
Checkpoints: Training Checkpoints
Paper: Arxiv link
Hugging Face Paper: mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Code: GitHub Repository

Citation

@misc{marone2025mmbertmodernmultilingualencoder,
 title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, 
 author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
 year={2025},
 eprint={2509.06888},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2509.06888}, 
}