SentenceTransformer

ModernBERT-small-v2 represents an efficient approach to creating highly efficient and accurate dense vector encoders. It leverages a small ModernBERT architecture, simple MLM training, and distillation from a larger performant model to achieve superior performance at a lower computational cost compared to standard large models.

Key Features & Training Methodology

This model was created using a specialized four-stage pipeline:

Deep & Narrow Architecture: Unlike typical small models (e.g., 6 layers), this student model features 12 Transformer layers but operates within a narrow 384-dimensional embedding space. This depth allows for complex multi-hop reasoning crucial for high-accuracy retrieval tasks, while the narrow dimension ensures extremely fast encoding and small index sizes.
Guided Initialization (GUIDE): The model did not start from random weights. It inherited structural and semantic knowledge from a larger teacher model (answerdotai/ModernBERT-base) via Principal Component Analysis (PCA) Projection. This technique surgically compressed the teacher's 768-dimensional knowledge into the student's 384-dimensional space, providing a massive "head start."
Extensive MLM Pre-training: Following initialization, the model underwent comprehensive Masked Language Modeling (MLM) pre-training on a highly diverse corpus combining:
- Search Data (MS MARCO)
- Academic Texts (Stanford Philosophy)
- General Knowledge (NPR, FineWiki)
Knowledge Distillation (STS Tuning): The final, critical stage optimized the model for semantic similarity. It was trained to mimic the output embeddings of a powerful Retrieval Teacher (Alibaba-NLP/gte-modernbert-base) using Mean Squared Error (MSE) loss. This specialized tuning ensures its 384-dimensional vectors excel at similarity and retrieval tasks.

Training

The final model, ModernBERT-small-v2, was trained using a curated combination of four distinct datasets during the MLM Pre-training phase to ensure broad general knowledge acquisition before the final distillation tuning.

GitHub: semantic-search-models/ModernBERT-small-v2

The following datasets were integrated and processed:

MS MARCO Triplets (sentence-transformers/msmarco-msmarco-MiniLM-L6-v3, "triplet" split)
- Source Focus: Query/Document ranking (Search Relevance).
Stanford Encyclopedia of Philosophy Triplets (johnnyboycurtis/Philosophical-Triplets-Retrieval)
- Source Focus: Deep, technical, and abstract academic reasoning.
NPR Articles (sentence-transformers/npr)
- Source Focus: Modern news, journalistic style, and general current events.
FineWiki (English) (HuggingFaceFW/finewiki, "en" split)
- Source Focus: Encyclopedic, factual knowledge spanning a wide range of topics.
- Only used in distillation training; not used in MLM.

(Note: During the final Knowledge Distillation phase, the targets were generated using embeddings from the teacher model (Alibaba-NLP/gte-modernbert-base) based on the combined text content of this merged corpus.)

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 1024 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- parquet

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
 (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
 (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

import torch
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnnyboycurtis/ModernBERT-small-v2", model_kwargs={"attn_implementation": "flash_attention_2", "dtype": torch.bfloat16}) # or use "sdpa"

# Run inference
sentences = [
 '# Breda Holmes\nBreda Holmes is a former camogie player, winner of the B+I Star of the Year award in 1987 and seven All Ireland medals in succession between 1984 and 1991, celebrating the seventh by scoring the match-turning goal from Ann Downey’s sideline ball against Cork in the 1991 final.\n\n## Career\nShe captained Carysfort Training College in their 1984 Purcell Cup campaign and won six All Ireland club medals with St Paul’s camogie club, based in Kilkenny city.\n',
 'What is Intellectual Property? Intellectual property (IP) refers to creations of the mind, such as inventions; literary and artistic works; designs; and symbols, names and images used in commerce. IP is protected in law by, for example, patents, copyright and trademarks, which enable people to earn recognition or financial benefit from what they invent or create.',
 '10 Most Famous Soccer Stadiums in the World. The Camp Nou with its capacity of 99,354 is the largest stadium in Europe and also the fourth largest soccer stadium in the world. It is situated in Barcelona, Catalonia, Spain, and is the home of Spanish club Barcelona since 1957.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.2616, 0.5490],
# [0.2616, 1.0000, 0.3196],
# [0.5490, 0.3196, 1.0000]])

Evaluation

Metrics

Knowledge Distillation

Dataset: mse-dev
Evaluated with MSEEvaluator

Metric	Value
negative_mse	-77.74

Information Retrieval

Datasets: NanoMSMARCO and NanoHotpotQA
Evaluated with InformationRetrievalEvaluator

Metric	NanoMSMARCO	NanoHotpotQA
cosine_accuracy@1	0.32	0.52
cosine_accuracy@3	0.52	0.76
cosine_accuracy@5	0.6	0.78
cosine_accuracy@10	0.76	0.84
cosine_precision@1	0.32	0.52
cosine_precision@3	0.1733	0.3333
cosine_precision@5	0.12	0.22
cosine_precision@10	0.076	0.122
cosine_recall@1	0.32	0.26
cosine_recall@3	0.52	0.5
cosine_recall@5	0.6	0.55
cosine_recall@10	0.76	0.61
cosine_ndcg@10	0.5251	0.5457
cosine_mrr@10	0.4523	0.6494
cosine_map@100	0.4624	0.4736

Nano BEIR

Dataset: NanoBEIR_mean

Evaluated with NanoBEIREvaluator with these parameters:

{
 "dataset_names": [
 "MSMARCO",
 "HotpotQA"
 ],
 "dataset_id": "sentence-transformers/NanoBEIR-en"
}

Metric	Value
cosine_accuracy@1	0.42
cosine_accuracy@3	0.64
cosine_accuracy@5	0.69
cosine_accuracy@10	0.8
cosine_precision@1	0.42
cosine_precision@3	0.2533
cosine_precision@5	0.17
cosine_precision@10	0.099
cosine_recall@1	0.29
cosine_recall@3	0.51
cosine_recall@5	0.575
cosine_recall@10	0.685
cosine_ndcg@10	0.5354
cosine_mrr@10	0.5509
cosine_map@100	0.468

Training Details

Training Dataset

parquet

Dataset: parquet
Size: 3,375,201 training samples
Columns: text and label
Approximate statistics based on the first 1000 samples:
text label
type string list
details
min: 5 tokens
mean: 280.41 tokens
max: 1024 tokens

size: 384 elements

	text	label
type	string	list
details	min: 5 tokens mean: 280.41 tokens max: 1024 tokens	size: 384 elements

Samples:

text	label
`# Scientists Link Diamonds To Earth's Quick Cooling Scientists say they have evidence the Earth was bombarded by meteors about 13,000 years ago, triggering a 1,000-year cold spell. Researchers write in the journal Science that they have found a layer of microscopic diamonds scattered across North America. An abrupt cooling may have caused many large mammals to become extinct.`	`[4.6171875, 2.515625, 2.439453125, -1.4853515625, -6.328125, ...]`
# Brad Giffen Brad Giffen is a retired Canadian news anchor who has worked on television in both Canada and the United States. Over his broadcasting career he has also worked as a radio personality, disc jockey, VJ, television reporter, television producer and voice-over artist. ## Broadcasting career Giffen studied at the Poynter Institute for Advanced Journalism Study. In the late 1980s he was a broadcaster on CHUM-FM radio station in Toronto, Ontario, Canada. He previously was John Majhor's successor veejay on CITY-TV's music video program Toronto Rocks. and he hosted the CBC Television battle of the bands competition Rock Wars. In 1990, Giffen pivoted to news journalism and became a reporter for CFTO's nightly news program World Beat News (later rebranded as CFTO News in early 1998, and CTV News in 2005). In 1993, Giffen moved to the United States and became co-anchor of the nightly news on the Fox affiliate KSTU, in Salt Lake City, Utah. Giffen left that post in 1995 to accept ...	`[-1.693359375, 13.3828125, 4.50390625, 0.41064453125, -2.884765625, ...]`
# How Trump Won, According To The Exit Polls Donald Trump will be the next president of the United States. That's remarkable for all sorts of reasons: He has no governmental experience, for example. And many times during his campaign, Trump's words inflamed large swaths of Americans, whether it was his comments from years ago talking about grabbing women's genitals or calling Mexican immigrants in the U.S. illegally "rapists" and playing up crimes committed by immigrants, including drug crimes and murders. But right now, it's also remarkable because almost no one saw it coming. All major forecasters predicted a Hillary Clinton win, whether moderately or by a landslide. So what happened? We don't know just yet why pollsters and forecasters got it wrong, but here's what made this electorate so different from the one that elected Barack Obama by 4 points in 2012. To be clear, it's impossible to break any election results out into fully discrete demographic groups or trends — race, gend...	`[3.4296875, 12.828125, 2.8203125, -5.47265625, -5.390625, ...]`

Loss: MSELoss

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
learning_rate: 0.0001
num_train_epochs: 2
warmup_steps: 0.1
fp16: True
load_best_model_at_end: True

All Hyperparameters

Training Logs

Framework Versions

Python: 3.11.13
Sentence Transformers: 5.2.2
Transformers: 5.1.0
PyTorch: 2.7.1+cu128
Accelerate: 1.9.0
Datasets: 4.0.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
 title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
 author = "Reimers, Nils and Gurevych, Iryna",
 booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
 month = "11",
 year = "2019",
 publisher = "Association for Computational Linguistics",
 url = "https://arxiv.org/abs/1908.10084",
}

MSELoss

@inproceedings{reimers-2020-multilingual-sentence-bert,
 title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
 author = "Reimers, Nils and Gurevych, Iryna",
 booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
 month = "11",
 year = "2020",
 publisher = "Association for Computational Linguistics",
 url = "https://arxiv.org/abs/2004.09813",
}

ModernBERT Model Architecture

@misc{warner2024smarterbetterfasterlonger,
 title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
 author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
 year={2024},
 eprint={2412.13663},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2412.13663}, 
}

Model Weight Initialization

@misc{trinh2025guideguidedinitializationdistillation,
 title={GUIDE: Guided Initialization and Distillation of Embeddings}, 
 author={Khoa Trinh and Gaurav Menghani and Erik Vee},
 year={2025},
 eprint={2510.06502},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2510.06502}, 
}

Downloads last month: 10

Safetensors

Model size

37M params

Tensor type

F32

Papers for johnnyboycurtis/ModernBERT-small-v2

Paper • 2510.06502 • Published Oct 7, 2025 • 1

Paper • 2412.13663 • Published Dec 18, 2024 • 166

Paper • 2004.09813 • Published Apr 21, 2020 • 1

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Evaluation results

Negative Mse on mse dev
self-reported
-77.740
Cosine Accuracy@1 on NanoMSMARCO
self-reported
0.320
Cosine Accuracy@3 on NanoMSMARCO
self-reported
0.520
Cosine Accuracy@5 on NanoMSMARCO
self-reported
0.600
Cosine Accuracy@10 on NanoMSMARCO
self-reported
0.760
Cosine Precision@1 on NanoMSMARCO
self-reported
0.320
Cosine Precision@3 on NanoMSMARCO
self-reported
0.173
Cosine Precision@5 on NanoMSMARCO
self-reported
0.120

URL: https://huggingface.co/johnnyboycurtis/ModernBERT-small-v2

⇱ johnnyboycurtis/ModernBERT-small-v2 · Hugging Face