Static Embeddings with BERT Multilingual uncased tokenizer finetuned on various datasets
This is a sentence-transformers model trained on the wikititles, tatoeba, talks, europarl, global_voices, muse, wikimatrix, opensubtitles, stackexchange, quora, wikianswers_duplicates, all_nli, simple_wiki, altlex, flickr30k_captions, coco_captions, nli_for_simcse and negation datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, paraphrase mining, text classification, clustering, and more.
Read our Static Embeddings blogpost to learn more about this model and how it was trained.
- 0 Active Parameters: This model does not use any active parameters, instead consisting exclusively of averaging pre-computed token embeddings.
- 100x to 400x faster: On CPU, this model is 100x to 400x faster than common options like multilingual-e5-small. On GPU, it's 10x to 25x faster.
- Matryoshka: This model was trained with a Matryoshka loss, allowing you to truncate the embeddings for faster retrieval at minimal performance costs.
- Evaluations: See Evaluations for details on performance on NanoBEIR, embedding speed, and Matryoshka dimensionality truncation.
- Training Script: See train.py for the training script used to train this model from scratch.
See static-retrieval-mrl-en-v1 for an English static embedding model that has been finetuned specifically for retrieval tasks.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: inf tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Training Datasets:
- Languages: en, multilingual, ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, gl, gu, he, hi, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh, hr
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): StaticEmbedding(
(embedding): EmbeddingBag(105879, 1024, mode='mean')
)
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1")
# Run inference
sentences = [
'It is known for its dry red chili powder .',
'It is popular for dry red chili powder .',
'These monsters will move in large groups .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
This model was trained with Matryoshka loss, allowing this model to be used with lower dimensionalities with minimal performance loss.
Notably, a lower dimensionality allows for much faster downstream tasks, such as clustering or classification. You can specify a lower dimensionality with the truncate_dim argument when initializing the Sentence Transformer model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1", truncate_dim=256)
embeddings = model.encode([
"I used to hate him.",
"Раньше я ненавидел его."
])
print(embeddings.shape)
# => (2, 256)
Evaluation
We've evaluated the model on 5 languages which have a lot of benchmarks across various tasks on MTEB.
We want to reiterate that this model is not intended for retrieval use cases. Instead, we evaluate on Semantic Textual Similarity (STS), Classification, and Pair Classification. We compare against the excellent and small multilingual-e5-small model.
Across all measured languages, static-similarity-mrl-multilingual-v1 reaches an average 92.3% for STS, 95.52% for Pair Classification, and 86.52% for Classification relative to multilingual-e5-small.
To make up for this performance reduction, static-similarity-mrl-multilingual-v1 is approximately ~125x faster on CPU and ~10x faster on GPU devices than multilingual-e5-small. Due to the super-linear nature of attention models, versus the linear nature of static embedding models, the speedup will only grow larger as the number of tokens to encode increases.
Matryoshka Evaluation
Lastly, we experimented with the impacts on English STS on MTEB performance when we did Matryoshka-style dimensionality reduction by truncating the output embeddings to a lower dimensionality.
👁 English STS MTEB performance vs Matryoshka dimensionality reduction
As you can see, you can easily reduce the dimensionality by 2x or 4x with minor (0.15% or 0.56%) performance hits. If the speed of your downstream task or your storage costs are a bottleneck, this should allow you to alleviate some of those concerns.
Training Details
Training Datasets
Evaluation Datasets
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: stepsper_device_train_batch_size: 2048per_device_eval_batch_size: 2048learning_rate: 0.2num_train_epochs: 1warmup_ratio: 0.1bf16: Truebatch_sampler: no_duplicates
All Hyperparameters
Training Logs
| Epoch | Step | Training Loss | wikititles loss | tatoeba loss | talks loss | europarl loss | global voices loss | muse loss | wikimatrix loss | opensubtitles loss | stackexchange loss | quora loss | wikianswers duplicates loss | all nli loss | simple wiki loss | altlex loss | flickr30k captions loss | coco captions loss | nli for simcse loss | negation loss |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0000 | 1 | 38.504 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 0.0327 | 1000 | 21.3661 | 15.2607 | 9.1892 | 11.6736 | 1.6431 | 6.6894 | 31.9579 | 3.0122 | 0.3541 | 5.1814 | 2.3756 | 4.9474 | 12.7699 | 0.5687 | 0.8911 | 21.0068 | 17.1302 | 10.8964 | 6.7603 |
| 0.0654 | 2000 | 9.8377 | 11.7637 | 7.1680 | 8.7697 | 1.6077 | 5.2310 | 27.4887 | 1.8375 | 0.3379 | 5.1107 | 2.2083 | 4.1690 | 12.0384 | 0.4837 | 0.7131 | 20.5401 | 17.8388 | 10.6706 | 7.0488 |
| 0.0982 | 3000 | 8.5279 | 10.8719 | 6.6160 | 8.3116 | 1.5638 | 4.7298 | 25.8572 | 1.6738 | 0.3152 | 5.1009 | 2.0893 | 3.7332 | 12.0452 | 0.4285 | 0.6519 | 20.2154 | 16.2715 | 10.7693 | 7.3144 |
| 0.1309 | 4000 | 7.8208 | 10.4614 | 5.4918 | 7.4421 | 1.4420 | 4.0505 | 24.9000 | 1.3462 | 0.2925 | 4.7643 | 2.1143 | 3.7457 | 11.6570 | 0.4390 | 0.6536 | 19.4405 | 16.0912 | 10.7537 | 7.2120 |
| 0.1636 | 5000 | 7.5347 | 9.5381 | 5.9489 | 7.4027 | 1.4858 | 4.0272 | 23.8335 | 1.2453 | 0.3027 | 3.1262 | 1.9170 | 3.7535 | 11.6186 | 0.4090 | 0.6131 | 18.9329 | 16.1769 | 10.1123 | 7.0750 |
| 0.1963 | 6000 | 7.1819 | 9.2175 | 5.3231 | 7.0836 | 1.4795 | 3.8328 | 23.1620 | 1.1609 | 0.2964 | 2.7653 | 1.9440 | 3.6610 | 11.2147 | 0.3714 | 0.5853 | 19.0478 | 16.4413 | 9.5790 | 6.8695 |
| 0.2291 | 7000 | 6.9852 | 9.0344 | 5.5773 | 6.7928 | 1.4409 | 3.9232 | 23.2098 | 1.1750 | 0.2877 | 2.9254 | 1.9411 | 3.5469 | 11.0744 | 0.4254 | 0.6293 | 19.0447 | 16.3774 | 9.5363 | 6.8393 |
| 0.2618 | 8000 | 6.8114 | 8.9620 | 5.1417 | 6.5466 | 1.4834 | 3.7100 | 22.9815 | 1.0679 | 0.2942 | 2.7687 | 2.0211 | 3.6063 | 11.3424 | 0.4447 | 0.6223 | 19.1836 | 16.5669 | 9.8785 | 6.8528 |
| 0.2945 | 9000 | 6.5487 | 8.6320 | 4.8710 | 6.5144 | 1.4156 | 3.5712 | 22.9660 | 1.0261 | 0.3051 | 3.0898 | 1.9981 | 3.4305 | 11.1448 | 0.3729 | 0.5814 | 18.8865 | 15.8581 | 9.5213 | 6.7567 |
| 0.3272 | 10000 | 6.7398 | 8.5630 | 4.7179 | 6.5025 | 1.3931 | 3.5699 | 22.5319 | 0.9916 | 0.2870 | 3.3385 | 1.9580 | 3.5807 | 11.2592 | 0.4155 | 0.6009 | 19.1387 | 16.6836 | 9.6300 | 6.6613 |
| 0.3599 | 11000 | 6.3915 | 8.4041 | 4.8985 | 6.2787 | 1.4081 | 3.5082 | 22.3204 | 0.9554 | 0.2916 | 2.9365 | 2.0176 | 3.3900 | 11.2956 | 0.3902 | 0.5783 | 18.6448 | 16.1241 | 9.5388 | 6.7295 |
| 0.3927 | 12000 | 6.5902 | 8.1888 | 4.7326 | 6.1930 | 1.4550 | 3.4999 | 22.1070 | 0.9736 | 0.2935 | 2.9612 | 1.9449 | 3.3281 | 11.0477 | 0.3821 | 0.5696 | 18.3227 | 16.1848 | 9.4772 | 7.0029 |
| 0.4254 | 13000 | 6.341 | 8.1827 | 4.3838 | 6.1052 | 1.4165 | 3.3944 | 21.9552 | 0.9076 | 0.2991 | 3.2272 | 1.9822 | 3.3494 | 11.1891 | 0.3790 | 0.5600 | 18.4394 | 15.9000 | 9.5644 | 6.9056 |
| 0.4581 | 14000 | 6.2067 | 8.1549 | 4.4833 | 6.0765 | 1.4055 | 3.3903 | 21.4785 | 0.8962 | 0.2919 | 2.8893 | 1.9540 | 3.3078 | 11.2100 | 0.3569 | 0.5461 | 18.7667 | 16.2978 | 9.2310 | 7.1290 |
| 0.4908 | 15000 | 6.2237 | 8.0711 | 4.4755 | 6.0087 | 1.3185 | 3.2888 | 21.3689 | 0.8433 | 0.2861 | 3.0129 | 1.9084 | 3.3279 | 11.1236 | 0.3730 | 0.5553 | 18.2711 | 15.7648 | 9.5295 | 7.0092 |
| 0.5236 | 16000 | 6.1058 | 8.0282 | 4.5076 | 5.8760 | 1.4234 | 3.3046 | 21.3568 | 0.8298 | 0.2826 | 2.8404 | 1.8920 | 3.2918 | 11.1140 | 0.3811 | 0.5550 | 18.2899 | 15.8630 | 9.4807 | 6.7585 |
| 0.5563 | 17000 | 6.3038 | 7.8679 | 4.4780 | 5.8461 | 1.4016 | 3.2279 | 21.0624 | 0.8205 | 0.2804 | 3.1359 | 1.9066 | 3.3205 | 11.0882 | 0.3913 | 0.5569 | 18.0693 | 15.7346 | 9.2854 | 6.9239 |
| 0.5890 | 18000 | 5.9824 | 7.7827 | 4.3199 | 5.7441 | 1.3582 | 3.1982 | 21.2444 | 0.8046 | 0.2797 | 2.7466 | 1.8717 | 3.3112 | 11.0553 | 0.3922 | 0.5568 | 18.0357 | 15.6732 | 9.6404 | 6.8331 |
| 0.6217 | 19000 | 6.0275 | 7.7201 | 4.3591 | 5.8132 | 1.3466 | 3.1888 | 20.9311 | 0.8019 | 0.2765 | 2.7674 | 1.8670 | 3.3082 | 10.9725 | 0.3996 | 0.5560 | 18.6346 | 16.2965 | 9.3774 | 6.9957 |
| 0.6545 | 20000 | 6.1161 | 7.6429 | 4.2702 | 5.7298 | 1.3670 | 3.1433 | 20.8899 | 0.7871 | 0.2761 | 2.7486 | 1.9230 | 3.2958 | 11.0207 | 0.3516 | 0.5361 | 18.2297 | 15.6363 | 9.6376 | 7.1608 |
| 0.6872 | 21000 | 5.9608 | 7.5852 | 4.2419 | 5.7760 | 1.3838 | 3.1878 | 20.9966 | 0.7837 | 0.2761 | 2.7098 | 1.8715 | 3.2293 | 10.8935 | 0.3514 | 0.5307 | 18.1424 | 15.5101 | 9.5346 | 7.0668 |
| 0.7199 | 22000 | 5.7594 | 7.5562 | 4.1123 | 5.6151 | 1.3605 | 3.0954 | 21.0032 | 0.7640 | 0.2769 | 2.6019 | 1.8378 | 3.2377 | 11.0744 | 0.3676 | 0.5431 | 18.2222 | 15.7103 | 9.8826 | 7.2662 |
| 0.7526 | 23000 | 5.7118 | 7.4714 | 4.0531 | 5.5998 | 1.3546 | 3.0778 | 20.8820 | 0.7518 | 0.2800 | 2.7544 | 1.8756 | 3.2316 | 10.9986 | 0.3571 | 0.5334 | 18.4476 | 15.7161 | 9.6617 | 7.3730 |
| 0.7853 | 24000 | 5.8024 | 7.4414 | 4.0829 | 5.6335 | 1.3383 | 3.0710 | 20.8217 | 0.7487 | 0.2713 | 2.6091 | 1.8695 | 3.2365 | 10.9929 | 0.3419 | 0.5213 | 18.4064 | 15.7831 | 9.7747 | 7.4290 |
| 0.8181 | 25000 | 5.8608 | 7.4348 | 4.0571 | 5.5651 | 1.3294 | 3.0518 | 20.6831 | 0.7393 | 0.2784 | 2.6330 | 1.8293 | 3.2197 | 10.9416 | 0.3484 | 0.5213 | 18.6359 | 15.8463 | 9.6883 | 7.4697 |
| 0.8508 | 26000 | 5.742 | 7.4188 | 3.9483 | 5.4911 | 1.3288 | 3.0402 | 20.7187 | 0.7376 | 0.2772 | 2.6812 | 1.8540 | 3.2415 | 10.9619 | 0.3560 | 0.5323 | 18.6388 | 15.7688 | 9.6707 | 7.3793 |
| 0.8835 | 27000 | 5.7429 | 7.3956 | 3.9016 | 5.4393 | 1.3277 | 3.0129 | 20.6748 | 0.7314 | 0.2820 | 2.6526 | 1.8798 | 3.1869 | 10.8744 | 0.3435 | 0.5228 | 18.5191 | 15.7264 | 9.5707 | 7.4266 |
| 0.9162 | 28000 | 5.7825 | 7.3748 | 3.9100 | 5.4261 | 1.3420 | 3.0142 | 20.6013 | 0.7263 | 0.2764 | 2.6708 | 1.8529 | 3.1748 | 10.8951 | 0.3491 | 0.5257 | 18.4914 | 15.5663 | 9.6552 | 7.2807 |
| 0.9490 | 29000 | 5.5179 | 7.3555 | 3.9046 | 5.3902 | 1.3283 | 2.9882 | 20.5828 | 0.7169 | 0.2732 | 2.6742 | 1.8457 | 3.1760 | 10.9126 | 0.3494 | 0.5246 | 18.5619 | 15.6746 | 9.6539 | 7.3694 |
| 0.9817 | 30000 | 5.4044 | 7.3390 | 3.8742 | 5.3713 | 1.3127 | 2.9796 | 20.5703 | 0.7120 | 0.2669 | 2.5612 | 1.8536 | 3.1602 | 10.9068 | 0.3464 | 0.5229 | 18.5389 | 15.6788 | 9.5690 | 7.4148 |
| 1.0000 | 30560 | - | 7.3346 | 3.8728 | 5.3680 | 1.3066 | 2.9780 | 20.5635 | 0.7107 | 0.2672 | 2.5046 | 1.8514 | 3.1596 | 10.9153 | 0.3467 | 0.5233 | 18.5525 | 15.6815 | 9.5687 | 7.4302 |
Environmental Impact
Carbon emissions were measured using CodeCarbon.
- Energy Consumed: 0.506 kWh
- Carbon Emitted: 0.197 kg of CO2
- Hours Used: 3.163 hours
Training Hardware
- On Cloud: No
- GPU Model: 1 x NVIDIA GeForce RTX 3090
- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
- RAM Size: 31.78 GB
Framework Versions
- Python: 3.11.6
- Sentence Transformers: 3.3.0.dev0
- Transformers: 4.45.2
- PyTorch: 2.5.0+cu121
- Accelerate: 1.0.0
- Datasets: 2.20.0
- Tokenizers: 0.20.1-dev.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
