Static Embeddings with BERT Multilingual uncased tokenizer finetuned on various datasets

This is a sentence-transformers model trained on the wikititles, tatoeba, talks, europarl, global_voices, muse, wikimatrix, opensubtitles, stackexchange, quora, wikianswers_duplicates, all_nli, simple_wiki, altlex, flickr30k_captions, coco_captions, nli_for_simcse and negation datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, paraphrase mining, text classification, clustering, and more.

Read our Static Embeddings blogpost to learn more about this model and how it was trained.

0 Active Parameters: This model does not use any active parameters, instead consisting exclusively of averaging pre-computed token embeddings.
100x to 400x faster: On CPU, this model is 100x to 400x faster than common options like multilingual-e5-small. On GPU, it's 10x to 25x faster.
Matryoshka: This model was trained with a Matryoshka loss, allowing you to truncate the embeddings for faster retrieval at minimal performance costs.
Evaluations: See Evaluations for details on performance on NanoBEIR, embedding speed, and Matryoshka dimensionality truncation.
Training Script: See train.py for the training script used to train this model from scratch.

See static-retrieval-mrl-en-v1 for an English static embedding model that has been finetuned specifically for retrieval tasks.

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: inf tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Datasets:
- wikititles
- tatoeba
- talks
- europarl
- global_voices
- muse
- wikimatrix
- opensubtitles
- stackexchange
- quora
- wikianswers_duplicates
- all_nli
- simple_wiki
- altlex
- flickr30k_captions
- coco_captions
- nli_for_simcse
- negation
Languages: en, multilingual, ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, gl, gu, he, hi, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh, hr
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
 (0): StaticEmbedding(
 (embedding): EmbeddingBag(105879, 1024, mode='mean')
 )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1")
# Run inference
sentences = [
 'It is known for its dry red chili powder .',
 'It is popular for dry red chili powder .',
 'These monsters will move in large groups .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

This model was trained with Matryoshka loss, allowing this model to be used with lower dimensionalities with minimal performance loss. Notably, a lower dimensionality allows for much faster downstream tasks, such as clustering or classification. You can specify a lower dimensionality with the truncate_dim argument when initializing the Sentence Transformer model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1", truncate_dim=256)
embeddings = model.encode([
 "I used to hate him.",
 "Раньше я ненавидел его."
])
print(embeddings.shape)
# => (2, 256)

Evaluation

We've evaluated the model on 5 languages which have a lot of benchmarks across various tasks on MTEB.

We want to reiterate that this model is not intended for retrieval use cases. Instead, we evaluate on Semantic Textual Similarity (STS), Classification, and Pair Classification. We compare against the excellent and small multilingual-e5-small model.

👁 Image

Across all measured languages, static-similarity-mrl-multilingual-v1 reaches an average 92.3% for STS, 95.52% for Pair Classification, and 86.52% for Classification relative to multilingual-e5-small.

👁 Image

To make up for this performance reduction, static-similarity-mrl-multilingual-v1 is approximately ~125x faster on CPU and ~10x faster on GPU devices than multilingual-e5-small. Due to the super-linear nature of attention models, versus the linear nature of static embedding models, the speedup will only grow larger as the number of tokens to encode increases.

Matryoshka Evaluation

Lastly, we experimented with the impacts on English STS on MTEB performance when we did Matryoshka-style dimensionality reduction by truncating the output embeddings to a lower dimensionality.

👁 English STS MTEB performance vs Matryoshka dimensionality reduction

As you can see, you can easily reduce the dimensionality by 2x or 4x with minor (0.15% or 0.56%) performance hits. If the speed of your downstream task or your storage costs are a bottleneck, this should allow you to alleviate some of those concerns.

Training Details

Training Datasets

Evaluation Datasets

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 2048
per_device_eval_batch_size: 2048
learning_rate: 0.2
num_train_epochs: 1
warmup_ratio: 0.1
bf16: True
batch_sampler: no_duplicates

All Hyperparameters

Training Logs

Epoch	Step	Training Loss	wikititles loss	tatoeba loss	talks loss	europarl loss	global voices loss	muse loss	wikimatrix loss	opensubtitles loss	stackexchange loss	quora loss	wikianswers duplicates loss	all nli loss	simple wiki loss	altlex loss	flickr30k captions loss	coco captions loss	nli for simcse loss	negation loss
0.0000	1	38.504	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
0.0327	1000	21.3661	15.2607	9.1892	11.6736	1.6431	6.6894	31.9579	3.0122	0.3541	5.1814	2.3756	4.9474	12.7699	0.5687	0.8911	21.0068	17.1302	10.8964	6.7603
0.0654	2000	9.8377	11.7637	7.1680	8.7697	1.6077	5.2310	27.4887	1.8375	0.3379	5.1107	2.2083	4.1690	12.0384	0.4837	0.7131	20.5401	17.8388	10.6706	7.0488
0.0982	3000	8.5279	10.8719	6.6160	8.3116	1.5638	4.7298	25.8572	1.6738	0.3152	5.1009	2.0893	3.7332	12.0452	0.4285	0.6519	20.2154	16.2715	10.7693	7.3144
0.1309	4000	7.8208	10.4614	5.4918	7.4421	1.4420	4.0505	24.9000	1.3462	0.2925	4.7643	2.1143	3.7457	11.6570	0.4390	0.6536	19.4405	16.0912	10.7537	7.2120
0.1636	5000	7.5347	9.5381	5.9489	7.4027	1.4858	4.0272	23.8335	1.2453	0.3027	3.1262	1.9170	3.7535	11.6186	0.4090	0.6131	18.9329	16.1769	10.1123	7.0750
0.1963	6000	7.1819	9.2175	5.3231	7.0836	1.4795	3.8328	23.1620	1.1609	0.2964	2.7653	1.9440	3.6610	11.2147	0.3714	0.5853	19.0478	16.4413	9.5790	6.8695
0.2291	7000	6.9852	9.0344	5.5773	6.7928	1.4409	3.9232	23.2098	1.1750	0.2877	2.9254	1.9411	3.5469	11.0744	0.4254	0.6293	19.0447	16.3774	9.5363	6.8393
0.2618	8000	6.8114	8.9620	5.1417	6.5466	1.4834	3.7100	22.9815	1.0679	0.2942	2.7687	2.0211	3.6063	11.3424	0.4447	0.6223	19.1836	16.5669	9.8785	6.8528
0.2945	9000	6.5487	8.6320	4.8710	6.5144	1.4156	3.5712	22.9660	1.0261	0.3051	3.0898	1.9981	3.4305	11.1448	0.3729	0.5814	18.8865	15.8581	9.5213	6.7567
0.3272	10000	6.7398	8.5630	4.7179	6.5025	1.3931	3.5699	22.5319	0.9916	0.2870	3.3385	1.9580	3.5807	11.2592	0.4155	0.6009	19.1387	16.6836	9.6300	6.6613
0.3599	11000	6.3915	8.4041	4.8985	6.2787	1.4081	3.5082	22.3204	0.9554	0.2916	2.9365	2.0176	3.3900	11.2956	0.3902	0.5783	18.6448	16.1241	9.5388	6.7295
0.3927	12000	6.5902	8.1888	4.7326	6.1930	1.4550	3.4999	22.1070	0.9736	0.2935	2.9612	1.9449	3.3281	11.0477	0.3821	0.5696	18.3227	16.1848	9.4772	7.0029
0.4254	13000	6.341	8.1827	4.3838	6.1052	1.4165	3.3944	21.9552	0.9076	0.2991	3.2272	1.9822	3.3494	11.1891	0.3790	0.5600	18.4394	15.9000	9.5644	6.9056
0.4581	14000	6.2067	8.1549	4.4833	6.0765	1.4055	3.3903	21.4785	0.8962	0.2919	2.8893	1.9540	3.3078	11.2100	0.3569	0.5461	18.7667	16.2978	9.2310	7.1290
0.4908	15000	6.2237	8.0711	4.4755	6.0087	1.3185	3.2888	21.3689	0.8433	0.2861	3.0129	1.9084	3.3279	11.1236	0.3730	0.5553	18.2711	15.7648	9.5295	7.0092
0.5236	16000	6.1058	8.0282	4.5076	5.8760	1.4234	3.3046	21.3568	0.8298	0.2826	2.8404	1.8920	3.2918	11.1140	0.3811	0.5550	18.2899	15.8630	9.4807	6.7585
0.5563	17000	6.3038	7.8679	4.4780	5.8461	1.4016	3.2279	21.0624	0.8205	0.2804	3.1359	1.9066	3.3205	11.0882	0.3913	0.5569	18.0693	15.7346	9.2854	6.9239
0.5890	18000	5.9824	7.7827	4.3199	5.7441	1.3582	3.1982	21.2444	0.8046	0.2797	2.7466	1.8717	3.3112	11.0553	0.3922	0.5568	18.0357	15.6732	9.6404	6.8331
0.6217	19000	6.0275	7.7201	4.3591	5.8132	1.3466	3.1888	20.9311	0.8019	0.2765	2.7674	1.8670	3.3082	10.9725	0.3996	0.5560	18.6346	16.2965	9.3774	6.9957
0.6545	20000	6.1161	7.6429	4.2702	5.7298	1.3670	3.1433	20.8899	0.7871	0.2761	2.7486	1.9230	3.2958	11.0207	0.3516	0.5361	18.2297	15.6363	9.6376	7.1608
0.6872	21000	5.9608	7.5852	4.2419	5.7760	1.3838	3.1878	20.9966	0.7837	0.2761	2.7098	1.8715	3.2293	10.8935	0.3514	0.5307	18.1424	15.5101	9.5346	7.0668
0.7199	22000	5.7594	7.5562	4.1123	5.6151	1.3605	3.0954	21.0032	0.7640	0.2769	2.6019	1.8378	3.2377	11.0744	0.3676	0.5431	18.2222	15.7103	9.8826	7.2662
0.7526	23000	5.7118	7.4714	4.0531	5.5998	1.3546	3.0778	20.8820	0.7518	0.2800	2.7544	1.8756	3.2316	10.9986	0.3571	0.5334	18.4476	15.7161	9.6617	7.3730
0.7853	24000	5.8024	7.4414	4.0829	5.6335	1.3383	3.0710	20.8217	0.7487	0.2713	2.6091	1.8695	3.2365	10.9929	0.3419	0.5213	18.4064	15.7831	9.7747	7.4290
0.8181	25000	5.8608	7.4348	4.0571	5.5651	1.3294	3.0518	20.6831	0.7393	0.2784	2.6330	1.8293	3.2197	10.9416	0.3484	0.5213	18.6359	15.8463	9.6883	7.4697
0.8508	26000	5.742	7.4188	3.9483	5.4911	1.3288	3.0402	20.7187	0.7376	0.2772	2.6812	1.8540	3.2415	10.9619	0.3560	0.5323	18.6388	15.7688	9.6707	7.3793
0.8835	27000	5.7429	7.3956	3.9016	5.4393	1.3277	3.0129	20.6748	0.7314	0.2820	2.6526	1.8798	3.1869	10.8744	0.3435	0.5228	18.5191	15.7264	9.5707	7.4266
0.9162	28000	5.7825	7.3748	3.9100	5.4261	1.3420	3.0142	20.6013	0.7263	0.2764	2.6708	1.8529	3.1748	10.8951	0.3491	0.5257	18.4914	15.5663	9.6552	7.2807
0.9490	29000	5.5179	7.3555	3.9046	5.3902	1.3283	2.9882	20.5828	0.7169	0.2732	2.6742	1.8457	3.1760	10.9126	0.3494	0.5246	18.5619	15.6746	9.6539	7.3694
0.9817	30000	5.4044	7.3390	3.8742	5.3713	1.3127	2.9796	20.5703	0.7120	0.2669	2.5612	1.8536	3.1602	10.9068	0.3464	0.5229	18.5389	15.6788	9.5690	7.4148
1.0000	30560	-	7.3346	3.8728	5.3680	1.3066	2.9780	20.5635	0.7107	0.2672	2.5046	1.8514	3.1596	10.9153	0.3467	0.5233	18.5525	15.6815	9.5687	7.4302

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 0.506 kWh
Carbon Emitted: 0.197 kg of CO2
Hours Used: 3.163 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 3.3.0.dev0
Transformers: 4.45.2
PyTorch: 2.5.0+cu121
Accelerate: 1.0.0
Datasets: 2.20.0
Tokenizers: 0.20.1-dev.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
 title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
 author = "Reimers, Nils and Gurevych, Iryna",
 booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
 month = "11",
 year = "2019",
 publisher = "Association for Computational Linguistics",
 url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
 title={Matryoshka Representation Learning},
 author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
 year={2024},
 eprint={2205.13147},
 archivePrefix={arXiv},
 primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
 title={Efficient Natural Language Response Suggestion for Smart Reply},
 author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
 year={2017},
 eprint={1705.00652},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train T3LS/static-similarity-mrl-multilingual-v1-32d-fp16

Papers for T3LS/static-similarity-mrl-multilingual-v1-32d-fp16

Paper • 2205.13147 • Published May 26, 2022 • 27

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Paper • 1705.00652 • Published May 1, 2017

URL: https://huggingface.co/T3LS/static-similarity-mrl-multilingual-v1-32d-fp16

⇱ T3LS/static-similarity-mrl-multilingual-v1-32d-fp16 · Hugging Face