VOOZH about

URL: https://huggingface.co/adeshkin/labse-kjh-ru-mnrl-1

⇱ adeshkin/labse-kjh-ru-mnrl-1 · Hugging Face


SentenceTransformer based on sentence-transformers/LaBSE

This is a sentence-transformers model finetuned from sentence-transformers/LaBSE on the khakas-russian-parallel-corpus dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
 (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
 (1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'cls', 'include_prompt': True})
 (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh', 'module_input_name': 'sentence_embedding', 'module_output_name': 'sentence_embedding'})
 (3): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("adeshkin/labse-kjh-ru-mnrl-1")
# Run inference
sentences = [
 'Тӧреенде сағыпчатхан чуртас узуны ортымах чуртас узунын санирында тузаланылча.',
 'Ожидаемая продолжительность жизни при рождении используется в качестве средней продолжительности жизни.',
 'Уранча. Торгаях, ты зачем беспокоишь гостя нашими заботами?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.7541, -0.0902],
# [ 0.7541, 1.0000, -0.1342],
# [-0.0902, -0.1342, 1.0000]])

Evaluation

Metrics

Translation

Metric Value
src2trg_accuracy 0.9548
trg2src_accuracy 0.951
mean_accuracy 0.9529

Training Details

Training Dataset

khakas-russian-parallel-corpus

  • Dataset: khakas-russian-parallel-corpus at 318e0f5
  • Size: 157,620 training samples
  • Columns: kjh and ru
  • Approximate statistics based on the first 1000 samples:
    kjh ru
    type string string
    details
    • min: 6 tokens
    • mean: 28.24 tokens
    • max: 111 tokens
    • min: 4 tokens
    • mean: 20.11 tokens
    • max: 84 tokens
  • Samples:
    kjh ru
    Тӧреенде сағыпчатхан чуртас узуны ортымах чуртас узунын санирында тузаланылча. Ожидаемая продолжительность жизни при рождении используется в качестве средней продолжительности жизни.
    ТЕЛЕФОН НОМЕРІ (пар полза) 11. НОМЕР ТЕЛЕФОНА (если имеется) 11.
    Танығлар піріктірілген полза, ол кӧзідіг прай тоғынчатхан танығларның хыраның узунының суммазынаң пиріл парған. Если признаки составные, данный показатель представлен суммой длины поля всех задействованных признаков.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
     "scale": 20.0,
     "similarity_fct": "cos_sim",
     "gather_across_devices": false,
     "directions": [
     "query_to_doc"
     ],
     "partition_mode": "joint",
     "hardness_mode": null,
     "hardness_strength": 0.0
    }
    

Evaluation Dataset

khakas-russian-parallel-corpus

  • Dataset: khakas-russian-parallel-corpus at 318e0f5
  • Size: 1,593 evaluation samples
  • Columns: kjh and ru
  • Approximate statistics based on the first 1000 samples:
    kjh ru
    type string string
    details
    • min: 7 tokens
    • mean: 28.36 tokens
    • max: 146 tokens
    • min: 5 tokens
    • mean: 20.47 tokens
    • max: 97 tokens
  • Samples:
    kjh ru
    Чуртас тооза пос чирінің омазын чӱреенде ал чӧрген, хайда даа полза, Хакас чирінеңер кӧглеен. Всю жизнь носил в сердце образ своей земли, где бы ни был, воспевал Хакасскую землю.
    2 Пил палых сурча 2 Таймень-рыба спрашивает:
    чолға сығар тим чит килген наступила пора собираться в путь
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
     "scale": 20.0,
     "similarity_fct": "cos_sim",
     "gather_across_devices": false,
     "directions": [
     "query_to_doc"
     ],
     "partition_mode": "joint",
     "hardness_mode": null,
     "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_steps: 1000
  • fp16: True

All Hyperparameters

Training Logs

Epoch Step Training Loss Validation Loss kjh-ru-random_mean_accuracy
0.0254 500 0.1515 0.0636 0.7759
0.0508 1000 0.0661 0.0503 0.8267
0.0761 1500 0.0460 0.0285 0.8534
0.1015 2000 0.0402 0.0271 0.8628
0.1269 2500 0.0328 0.0240 0.8741
0.1523 3000 0.0253 0.0293 0.8851
0.1776 3500 0.0247 0.0231 0.8930
0.2030 4000 0.0285 0.0157 0.9090
0.2284 4500 0.0216 0.0153 0.9002
0.2538 5000 0.0172 0.0142 0.9171
0.2791 5500 0.0215 0.0170 0.8983
0.3045 6000 0.0172 0.0138 0.9187
0.3299 6500 0.0109 0.0162 0.9175
0.3553 7000 0.0146 0.0115 0.9253
0.3807 7500 0.0144 0.0149 0.9278
0.4060 8000 0.0116 0.0101 0.9347
0.4314 8500 0.0119 0.0142 0.9369
0.4568 9000 0.0196 0.0127 0.9382
0.4822 9500 0.0090 0.0120 0.9372
0.5075 10000 0.0123 0.0129 0.9438
0.5329 10500 0.0113 0.0086 0.9397
0.5583 11000 0.0091 0.0113 0.9435
0.5837 11500 0.0118 0.0104 0.9419
0.6090 12000 0.0076 0.0099 0.9429
0.6344 12500 0.0115 0.0081 0.9401
0.6598 13000 0.0074 0.0095 0.9466
0.6852 13500 0.0116 0.0090 0.9466
0.7106 14000 0.0107 0.0082 0.9520
0.7359 14500 0.0125 0.0068 0.9498
0.7613 15000 0.0109 0.0092 0.9507
0.7867 15500 0.0064 0.0069 0.9529
0.8121 16000 0.0077 0.0079 0.9532
0.8374 16500 0.0063 0.0067 0.9539
0.8628 17000 0.0072 0.0057 0.9539
0.8882 17500 0.0075 0.0060 0.9545
0.9136 18000 0.0098 0.0061 0.9539
0.9389 18500 0.0057 0.0058 0.9539
0.9643 19000 0.0079 0.0059 0.9526
0.9897 19500 0.0052 0.0058 0.9529

Training Time

  • Training: 1.4 hours

Framework Versions

  • Python: 3.12.13
  • Sentence Transformers: 5.4.1
  • Transformers: 5.0.0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
 title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
 author = "Reimers, Nils and Gurevych, Iryna",
 booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
 month = "11",
 year = "2019",
 publisher = "Association for Computational Linguistics",
 url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
 title={Representation Learning with Contrastive Predictive Coding},
 author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
 year={2019},
 eprint={1807.03748},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/1807.03748},
}
Downloads last month
11
Safetensors
Model size
0.5B params
Tensor type
F32
·

Model tree for adeshkin/labse-kjh-ru-mnrl-1

Finetuned
(87)
this model

Dataset used to train adeshkin/labse-kjh-ru-mnrl-1

Papers for adeshkin/labse-kjh-ru-mnrl-1

Evaluation results