mmBERT: A Modern Multilingual Encoder
👁 License: MIT
👁 Paper
👁 Model
👁 Collection
👁 GitHub
TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase.
mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder.
Table of Contents
- Highlights
- Quick Start
- Model Description
- Novel Training Innovations
- Model Family
- Training Data
- Usage Examples
- Fine-tuning Examples
- Model Architecture
- Citation
Quick Start
Installation
pip install torch>=1.9.0
pip install transformers>=4.21.0
Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
Model Description
mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include:
- Massive Language Coverage - Trained on over 1800 languages with progressive inclusion strategy
- Modern Architecture - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques
- Novel Training Recipe - Introduces inverse mask scheduling and temperature sampling
- Open Training Data - Complete 3T+ token dataset publicly available
- Decay Phase Innovation - Demonstrates effective learning of low-resource languages in final training phase
The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer.
Novel Training Innovations
Progressive Language Addition: Start with 60 high-resource languages, expand to 110 mid-resource languages, then include all 1833 languages in decay phase.
Inverse Mask Schedule: Reduce mask ratio from 30% → 15% → 5% across training phases for progressively refined learning.
Inverse Temperature Sampling: Adjust multilingual sampling from high-resource bias (τ=0.7) to uniform sampling (τ=0.3).
Model Merging: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
Model Family
| Model | Total Params | Non-embed Params | Languages | Download |
|---|---|---|---|---|
| mmBERT-small | 140M | 42M | 1800+ | 👁 Download |
| mmBERT-base | 307M | 110M | 1800+ | 👁 Download |
Training Data
mmBERT training data is publicly available across different phases:
| Phase | Dataset | Tokens | Description |
|---|---|---|---|
| Pre-training P1 | mmbert-pretrain-p1 | 2.3T | 60 languages, foundational training |
| Pre-training P2 | mmbert-pretrain-p2 | - | Extension data for pre-training phase |
| Pre-training P3 | mmbert-pretrain-p3 | - | Final pre-training data |
| Mid-training | mmbert-midtraining | 600B | 110 languages, context extension to 8K |
| Decay Phase | mmbert-decay | 100B | 1833 languages, premium quality |
Data Sources: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange).
Model Architecture
| Parameter | mmBERT-small | mmBERT-base |
|---|---|---|
| Layers | 22 | 22 |
| Hidden Size | 384 | 768 |
| Intermediate Size | 1152 | 1152 |
| Attention Heads | 6 | 12 |
| Total Parameters | 140M | 307M |
| Non-embedding Parameters | 42M | 110M |
| Max Sequence Length | 8192 | 8192 |
| Vocabulary Size | 256,000 | 256,000 |
| Tokenizer | Gemma 2 | Gemma 2 |
Usage Examples
Masked Language Modeling
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base")
def predict_masked_token(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
top_tokens = torch.topk(predictions, 5, dim=-1)
return [tokenizer.decode(token) for token in top_tokens.indices[0]]
# Works across languages
texts = [
"The capital of France is <mask>.",
"La capital de España es <mask>.",
"Die Hauptstadt von Deutschland ist <mask>."
]
for text in texts:
predictions = predict_masked_token(text)
print(f"Text: {text}")
print(f"Predictions: {predictions}")
Cross-lingual Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
def get_embeddings(texts):
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings.numpy()
multilingual_texts = [
"Artificial intelligence is transforming technology",
"La inteligencia artificial está transformando la tecnología",
"L'intelligence artificielle transforme la technologie",
"人工智能正在改变技术"
]
embeddings = get_embeddings(multilingual_texts)
similarities = cosine_similarity(embeddings)
print("Cross-lingual similarity matrix:")
print(similarities)
Fine-tuning Examples
Dense Retrieval with Sentence Transformers
Cross-lingual Classification
Multilingual Reranking
Training Data
mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
| Phase | Dataset | Description |
|---|---|---|
| Pre-training P1 | 2.3T tokens | 60 languages, diverse data mixture |
| Pre-training P2 | - | Extension data for pre-training |
| Pre-training P3 | - | Final pre-training data |
| Mid-training | 600B tokens | 110 languages, context extension |
| Decay Phase | 100B tokens | 1833 languages, premium quality |
Primary Sources:
- Filtered DCLM: High-quality English content
- FineWeb2: Broad multilingual web coverage (1800+ languages)
- FineWeb2-HQ: Filtered subset of 20 high-resource languages
- Code: StarCoder and ProLong repositories
- Academic: ArXiv papers and PeS2o scientific content
- Reference: Wikipedia (MegaWika) and textbooks
- Community: StackExchange discussions
Citation
If you use mmBERT in your research, please cite our work:
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
"""
- Downloads last month
- 307,975
