VOOZH about

URL: https://huggingface.co/T3LS/static-similarity-mrl-multilingual-v1-32d-fp16

⇱ T3LS/static-similarity-mrl-multilingual-v1-32d-fp16 · Hugging Face


Static Embeddings with BERT Multilingual uncased tokenizer finetuned on various datasets

This is a sentence-transformers model trained on the wikititles, tatoeba, talks, europarl, global_voices, muse, wikimatrix, opensubtitles, stackexchange, quora, wikianswers_duplicates, all_nli, simple_wiki, altlex, flickr30k_captions, coco_captions, nli_for_simcse and negation datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, paraphrase mining, text classification, clustering, and more.

Read our Static Embeddings blogpost to learn more about this model and how it was trained.

  • 0 Active Parameters: This model does not use any active parameters, instead consisting exclusively of averaging pre-computed token embeddings.
  • 100x to 400x faster: On CPU, this model is 100x to 400x faster than common options like multilingual-e5-small. On GPU, it's 10x to 25x faster.
  • Matryoshka: This model was trained with a Matryoshka loss, allowing you to truncate the embeddings for faster retrieval at minimal performance costs.
  • Evaluations: See Evaluations for details on performance on NanoBEIR, embedding speed, and Matryoshka dimensionality truncation.
  • Training Script: See train.py for the training script used to train this model from scratch.

See static-retrieval-mrl-en-v1 for an English static embedding model that has been finetuned specifically for retrieval tasks.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
 (0): StaticEmbedding(
 (embedding): EmbeddingBag(105879, 1024, mode='mean')
 )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1")
# Run inference
sentences = [
 'It is known for its dry red chili powder .',
 'It is popular for dry red chili powder .',
 'These monsters will move in large groups .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

This model was trained with Matryoshka loss, allowing this model to be used with lower dimensionalities with minimal performance loss. Notably, a lower dimensionality allows for much faster downstream tasks, such as clustering or classification. You can specify a lower dimensionality with the truncate_dim argument when initializing the Sentence Transformer model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tomaarsen/static-similarity-mrl-multilingual-v1", truncate_dim=256)
embeddings = model.encode([
 "I used to hate him.",
 "Раньше я ненавидел его."
])
print(embeddings.shape)
# => (2, 256)

Evaluation

We've evaluated the model on 5 languages which have a lot of benchmarks across various tasks on MTEB.

We want to reiterate that this model is not intended for retrieval use cases. Instead, we evaluate on Semantic Textual Similarity (STS), Classification, and Pair Classification. We compare against the excellent and small multilingual-e5-small model.

👁 Image

Across all measured languages, static-similarity-mrl-multilingual-v1 reaches an average 92.3% for STS, 95.52% for Pair Classification, and 86.52% for Classification relative to multilingual-e5-small.

👁 Image

To make up for this performance reduction, static-similarity-mrl-multilingual-v1 is approximately ~125x faster on CPU and ~10x faster on GPU devices than multilingual-e5-small. Due to the super-linear nature of attention models, versus the linear nature of static embedding models, the speedup will only grow larger as the number of tokens to encode increases.

Matryoshka Evaluation

Lastly, we experimented with the impacts on English STS on MTEB performance when we did Matryoshka-style dimensionality reduction by truncating the output embeddings to a lower dimensionality.

👁 English STS MTEB performance vs Matryoshka dimensionality reduction

As you can see, you can easily reduce the dimensionality by 2x or 4x with minor (0.15% or 0.56%) performance hits. If the speed of your downstream task or your storage costs are a bottleneck, this should allow you to alleviate some of those concerns.

Training Details

Training Datasets

Evaluation Datasets

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 2048
  • per_device_eval_batch_size: 2048
  • learning_rate: 0.2
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Training Logs

Epoch Step Training Loss wikititles loss tatoeba loss talks loss europarl loss global voices loss muse loss wikimatrix loss opensubtitles loss stackexchange loss quora loss wikianswers duplicates loss all nli loss simple wiki loss altlex loss flickr30k captions loss coco captions loss nli for simcse loss negation loss
0.0000 1 38.504 - - - - - - - - - - - - - - - - - -
0.0327 1000 21.3661 15.2607 9.1892 11.6736 1.6431 6.6894 31.9579 3.0122 0.3541 5.1814 2.3756 4.9474 12.7699 0.5687 0.8911 21.0068 17.1302 10.8964 6.7603
0.0654 2000 9.8377 11.7637 7.1680 8.7697 1.6077 5.2310 27.4887 1.8375 0.3379 5.1107 2.2083 4.1690 12.0384 0.4837 0.7131 20.5401 17.8388 10.6706 7.0488
0.0982 3000 8.5279 10.8719 6.6160 8.3116 1.5638 4.7298 25.8572 1.6738 0.3152 5.1009 2.0893 3.7332 12.0452 0.4285 0.6519 20.2154 16.2715 10.7693 7.3144
0.1309 4000 7.8208 10.4614 5.4918 7.4421 1.4420 4.0505 24.9000 1.3462 0.2925 4.7643 2.1143 3.7457 11.6570 0.4390 0.6536 19.4405 16.0912 10.7537 7.2120
0.1636 5000 7.5347 9.5381 5.9489 7.4027 1.4858 4.0272 23.8335 1.2453 0.3027 3.1262 1.9170 3.7535 11.6186 0.4090 0.6131 18.9329 16.1769 10.1123 7.0750
0.1963 6000 7.1819 9.2175 5.3231 7.0836 1.4795 3.8328 23.1620 1.1609 0.2964 2.7653 1.9440 3.6610 11.2147 0.3714 0.5853 19.0478 16.4413 9.5790 6.8695
0.2291 7000 6.9852 9.0344 5.5773 6.7928 1.4409 3.9232 23.2098 1.1750 0.2877 2.9254 1.9411 3.5469 11.0744 0.4254 0.6293 19.0447 16.3774 9.5363 6.8393
0.2618 8000 6.8114 8.9620 5.1417 6.5466 1.4834 3.7100 22.9815 1.0679 0.2942 2.7687 2.0211 3.6063 11.3424 0.4447 0.6223 19.1836 16.5669 9.8785 6.8528
0.2945 9000 6.5487 8.6320 4.8710 6.5144 1.4156 3.5712 22.9660 1.0261 0.3051 3.0898 1.9981 3.4305 11.1448 0.3729 0.5814 18.8865 15.8581 9.5213 6.7567
0.3272 10000 6.7398 8.5630 4.7179 6.5025 1.3931 3.5699 22.5319 0.9916 0.2870 3.3385 1.9580 3.5807 11.2592 0.4155 0.6009 19.1387 16.6836 9.6300 6.6613
0.3599 11000 6.3915 8.4041 4.8985 6.2787 1.4081 3.5082 22.3204 0.9554 0.2916 2.9365 2.0176 3.3900 11.2956 0.3902 0.5783 18.6448 16.1241 9.5388 6.7295
0.3927 12000 6.5902 8.1888 4.7326 6.1930 1.4550 3.4999 22.1070 0.9736 0.2935 2.9612 1.9449 3.3281 11.0477 0.3821 0.5696 18.3227 16.1848 9.4772 7.0029
0.4254 13000 6.341 8.1827 4.3838 6.1052 1.4165 3.3944 21.9552 0.9076 0.2991 3.2272 1.9822 3.3494 11.1891 0.3790 0.5600 18.4394 15.9000 9.5644 6.9056
0.4581 14000 6.2067 8.1549 4.4833 6.0765 1.4055 3.3903 21.4785 0.8962 0.2919 2.8893 1.9540 3.3078 11.2100 0.3569 0.5461 18.7667 16.2978 9.2310 7.1290
0.4908 15000 6.2237 8.0711 4.4755 6.0087 1.3185 3.2888 21.3689 0.8433 0.2861 3.0129 1.9084 3.3279 11.1236 0.3730 0.5553 18.2711 15.7648 9.5295 7.0092
0.5236 16000 6.1058 8.0282 4.5076 5.8760 1.4234 3.3046 21.3568 0.8298 0.2826 2.8404 1.8920 3.2918 11.1140 0.3811 0.5550 18.2899 15.8630 9.4807 6.7585
0.5563 17000 6.3038 7.8679 4.4780 5.8461 1.4016 3.2279 21.0624 0.8205 0.2804 3.1359 1.9066 3.3205 11.0882 0.3913 0.5569 18.0693 15.7346 9.2854 6.9239
0.5890 18000 5.9824 7.7827 4.3199 5.7441 1.3582 3.1982 21.2444 0.8046 0.2797 2.7466 1.8717 3.3112 11.0553 0.3922 0.5568 18.0357 15.6732 9.6404 6.8331
0.6217 19000 6.0275 7.7201 4.3591 5.8132 1.3466 3.1888 20.9311 0.8019 0.2765 2.7674 1.8670 3.3082 10.9725 0.3996 0.5560 18.6346 16.2965 9.3774 6.9957
0.6545 20000 6.1161 7.6429 4.2702 5.7298 1.3670 3.1433 20.8899 0.7871 0.2761 2.7486 1.9230 3.2958 11.0207 0.3516 0.5361 18.2297 15.6363 9.6376 7.1608
0.6872 21000 5.9608 7.5852 4.2419 5.7760 1.3838 3.1878 20.9966 0.7837 0.2761 2.7098 1.8715 3.2293 10.8935 0.3514 0.5307 18.1424 15.5101 9.5346 7.0668
0.7199 22000 5.7594 7.5562 4.1123 5.6151 1.3605 3.0954 21.0032 0.7640 0.2769 2.6019 1.8378 3.2377 11.0744 0.3676 0.5431 18.2222 15.7103 9.8826 7.2662
0.7526 23000 5.7118 7.4714 4.0531 5.5998 1.3546 3.0778 20.8820 0.7518 0.2800 2.7544 1.8756 3.2316 10.9986 0.3571 0.5334 18.4476 15.7161 9.6617 7.3730
0.7853 24000 5.8024 7.4414 4.0829 5.6335 1.3383 3.0710 20.8217 0.7487 0.2713 2.6091 1.8695 3.2365 10.9929 0.3419 0.5213 18.4064 15.7831 9.7747 7.4290
0.8181 25000 5.8608 7.4348 4.0571 5.5651 1.3294 3.0518 20.6831 0.7393 0.2784 2.6330 1.8293 3.2197 10.9416 0.3484 0.5213 18.6359 15.8463 9.6883 7.4697
0.8508 26000 5.742 7.4188 3.9483 5.4911 1.3288 3.0402 20.7187 0.7376 0.2772 2.6812 1.8540 3.2415 10.9619 0.3560 0.5323 18.6388 15.7688 9.6707 7.3793
0.8835 27000 5.7429 7.3956 3.9016 5.4393 1.3277 3.0129 20.6748 0.7314 0.2820 2.6526 1.8798 3.1869 10.8744 0.3435 0.5228 18.5191 15.7264 9.5707 7.4266
0.9162 28000 5.7825 7.3748 3.9100 5.4261 1.3420 3.0142 20.6013 0.7263 0.2764 2.6708 1.8529 3.1748 10.8951 0.3491 0.5257 18.4914 15.5663 9.6552 7.2807
0.9490 29000 5.5179 7.3555 3.9046 5.3902 1.3283 2.9882 20.5828 0.7169 0.2732 2.6742 1.8457 3.1760 10.9126 0.3494 0.5246 18.5619 15.6746 9.6539 7.3694
0.9817 30000 5.4044 7.3390 3.8742 5.3713 1.3127 2.9796 20.5703 0.7120 0.2669 2.5612 1.8536 3.1602 10.9068 0.3464 0.5229 18.5389 15.6788 9.5690 7.4148
1.0000 30560 - 7.3346 3.8728 5.3680 1.3066 2.9780 20.5635 0.7107 0.2672 2.5046 1.8514 3.1596 10.9153 0.3467 0.5233 18.5525 15.6815 9.5687 7.4302

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 0.506 kWh
  • Carbon Emitted: 0.197 kg of CO2
  • Hours Used: 3.163 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 3.3.0.dev0
  • Transformers: 4.45.2
  • PyTorch: 2.5.0+cu121
  • Accelerate: 1.0.0
  • Datasets: 2.20.0
  • Tokenizers: 0.20.1-dev.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
 title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
 author = "Reimers, Nils and Gurevych, Iryna",
 booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
 month = "11",
 year = "2019",
 publisher = "Association for Computational Linguistics",
 url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
 title={Matryoshka Representation Learning},
 author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
 year={2024},
 eprint={2205.13147},
 archivePrefix={arXiv},
 primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
 title={Efficient Natural Language Response Suggestion for Smart Reply},
 author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
 year={2017},
 eprint={1705.00652},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}
Downloads last month

-

Downloads are not tracked for this model. How to track

Datasets used to train T3LS/static-similarity-mrl-multilingual-v1-32d-fp16

Papers for T3LS/static-similarity-mrl-multilingual-v1-32d-fp16