VOOZH about

URL: https://huggingface.co/datasets/orionweller/mmBERT-pretrain-p1-fineweb2-langs

โ‡ฑ orionweller/mmBERT-pretrain-p1-fineweb2-langs ยท Datasets at Hugging Face


Dataset Viewer
Duplicate

mmBERT Pre-training Data P1

๐Ÿ‘ License: MIT
๐Ÿ‘ Paper
๐Ÿ‘ Models
๐Ÿ‘ GitHub

Phase 1 of 3: Diverse multilingual pre-training data mixture (trained for 2.3T tokens) used to train the mmBERT model suite.

NOTE: this is only P1 of the pre-training data due to HF limits, you need to download and combine all three into one folder

This dataset contains the pre-training phase data used to train all mmBERT encoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.

๐Ÿ“Š Data Composition

Data Source Tokens (B) Percentage Description
FineWeb2 1,196.6 60.2% High-quality multilingual web crawl data
DCLM 600.0 30.2% High-quality English web crawl data
Starcoder 100.6 5.1% Code repositories and files
Arxiv 27.8 1.4% Academic preprints
StackExchange 18.6 0.9% Q&A forums
Tulu Flan 15.3 0.8% Instruction-following data
Dolmino Math 11.2 0.6% Mathematical content
PeS2o 8.4 0.4% Scientific papers
Wikipedia (MegaWika) 4.7 0.2% Encyclopedia articles
Books 4.3 0.2% Literature and reference books
StackExchange (Dolmino) 1.4 0.1% Curated Q&A content
Total 1,989.0 100.0% Diverse mixture for foundation training

๐ŸŒ Language Coverage

This phase covers 60 languages plus code, with an inverse temperature sampling schedule starting at ฯ„=0.7. Languages include:

  • High-resource: English (34.5%), Russian (5.8%), German (4.4%), Spanish (4.5%), French (4.0%), Chinese (5.2%)
  • Mid-resource: Italian, Portuguese, Japanese, Dutch, Polish, and 45 others
  • Scripts: Latin, Cyrillic, Arabic, Chinese, Japanese, Thai, and many more

๐Ÿš€ Usage

For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT

Direct Access

Use the script at this link to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's Snapshot Download.

๐Ÿ”— Related Resources

Citation

@misc{marone2025mmbertmodernmultilingualencoder,
 title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, 
 author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
 year={2025},
 eprint={2509.06888},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2509.06888}, 
}
Downloads last month
879

Models trained or fine-tuned on orionweller/mmBERT-pretrain-p1-fineweb2-langs

Paper for orionweller/mmBERT-pretrain-p1-fineweb2-langs

Article mentioning orionweller/mmBERT-pretrain-p1-fineweb2-langs