VOOZH about

URL: https://huggingface.co/adeshkin/nllb-200-distilled-600M-kjh-ru

⇱ adeshkin/nllb-200-distilled-600M-kjh-ru · Hugging Face


nllb-200-distilled-600M-kjh-ru

A Khakas ↔ Russian machine translation model created by fine-tuning facebook/nllb-200-distilled-600M on a Khakas–Russian parallel corpus.

Khakas (Хакас тілі) is a low-resource Turkic language spoken in the Republic of Khakassia, Russia.

Quick Start

Requirements: transformers, torch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "adeshkin/nllb-200-distilled-600M-kjh-ru"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# --- Khakas → Russian ---
src_lang = "kjh_Cyrl" # Khakas
tgt_lang = "rus_Cyrl" # Russian
text = '54. "Ат ӱгредерде арғамҷың пик ползын, чонға чоохтирда чооғың сын ползын" сӧспектің тузазын чарыда пас пиріңер.'

tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=1024)

model.eval()
with torch.no_grad():
 outputs = model.generate(
 **inputs.to(model.device),
 forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
 max_new_tokens=int(32 + 3 * inputs.input_ids.shape[1]),
 num_beams=4,
 )

result = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)
print(result)
# 54. Объясните значение пословицы "При обучении коня пусть будет крепкий верёв твой, при обращении к народу пусть будет истинно слово твоё."

To translate in the opposite direction, set src_lang = "rus_Cyrl" and tgt_lang = "kjh_Cyrl".

Base Model

facebook/nllb-200-distilled-600M is a encoder-decoder machine translation model, belonging to the NLLB (No Language Left Behind) family. It supports translation among 200 languages. For this model, the original NLLB tokenizer and embedding layer were extended to support the Khakas language ( kjh_Cyrl), initialized using Kazakh (kaz_Cyrl) as a structurally similar language.

Fine-Tuning

The model was full fine-tuned on a mixture of datasets containing Khakas–Russian sentence pairs. The overall training and tokenizer extension approach is based on the guide How to fine-tune an NLLB-200 model for translating a new language.

Tokenizer Update

Because the original NLLB tokenizer marked some Khakas characters as <unk> (unknown), the vocabulary was explicitly extended. The update_tokenizer.py script:

  1. Trains a new SentencePiece model on the adeshkin/kjh-mono-sents to identify missing tokens.
  2. Modifies the underlying sentencepiece_model_pb2 protobuf of the NLLB tokenizer to append these new Khakas tokens.
  3. Updates the overall vocabulary size of the tokenizer and the model embeddings.

Note: Running update_tokenizer.py requires exactly transformers==4.57.3 due to how tokenizer internals are modified. However, the resulting model and tokenizer can be used with any later versions of transformers for both training and inference.

Training Hyperparameters

  • Max sequence length: 128
  • Batch size: 16 (per device) with 2 gradient accumulation steps
  • Learning rate: 1e-4
  • Optimizer: Adafactor
  • LR scheduler: Cosine with warmup
  • Warmup steps: 1,000
  • Max steps: 200,000
  • Precision: fp16 (autocast)
  • Hardware: 1x NVIDIA Tesla T4 (Google Colab)
  • Training time: ~10 hours

For full training details and scripts, see the khakas-mt repository.

Training Data

The training corpus consists of ~160k parallel sentence pairs. During training, the sampling ratio between translation directions was dynamically set to 60% (kjh → ru) and 40% (ru → kjh).

Source Pairs Link
Khakas–Russian Parallel Corpus 159,213 adeshkin/khakas-russian-parallel-corpus
Google SmolSent 863 adeshkin/google-smol-en-ru-kjh (smolsent)
Google SmolDoc 825 adeshkin/google-smol-en-ru-kjh (smoldoc)

Evaluation

Evaluated on the FLORES+ devtest split (1,012 sentence pairs) using SacreBLEU:

Direction BLEU chrF++
kjh → ru 24.40 50.12
ru → kjh 19.09 51.10

FLORES+ dev split (997 sentences) was used for validation during training.

License

This model is distributed under the CC-BY-NC 4.0 License, as the base model is licensed under the same terms.

Downloads last month
134
Safetensors
Model size
0.6B params
Tensor type
F32
·

Model tree for adeshkin/nllb-200-distilled-600M-kjh-ru

Finetuned
(305)
this model

Datasets used to train adeshkin/nllb-200-distilled-600M-kjh-ru

Collection including adeshkin/nllb-200-distilled-600M-kjh-ru