nllb-200-distilled-600M-kjh-ru
A Khakas ↔ Russian machine translation model created by fine-tuning facebook/nllb-200-distilled-600M on a Khakas–Russian parallel corpus.
Khakas (Хакас тілі) is a low-resource Turkic language spoken in the Republic of Khakassia, Russia.
Quick Start
Requirements: transformers, torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model_name = "adeshkin/nllb-200-distilled-600M-kjh-ru"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# --- Khakas → Russian ---
src_lang = "kjh_Cyrl" # Khakas
tgt_lang = "rus_Cyrl" # Russian
text = '54. "Ат ӱгредерде арғамҷың пик ползын, чонға чоохтирда чооғың сын ползын" сӧспектің тузазын чарыда пас пиріңер.'
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=1024)
model.eval()
with torch.no_grad():
outputs = model.generate(
**inputs.to(model.device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(32 + 3 * inputs.input_ids.shape[1]),
num_beams=4,
)
result = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)
print(result)
# 54. Объясните значение пословицы "При обучении коня пусть будет крепкий верёв твой, при обращении к народу пусть будет истинно слово твоё."
To translate in the opposite direction, set src_lang = "rus_Cyrl" and tgt_lang = "kjh_Cyrl".
Base Model
facebook/nllb-200-distilled-600M is a encoder-decoder machine translation model, belonging to the NLLB (No Language Left Behind) family. It supports translation among 200 languages.
For this model, the original NLLB tokenizer and embedding layer were extended to support the Khakas language (
kjh_Cyrl), initialized using Kazakh (kaz_Cyrl) as a structurally similar language.
Fine-Tuning
The model was full fine-tuned on a mixture of datasets containing Khakas–Russian sentence pairs. The overall training and tokenizer extension approach is based on the guide How to fine-tune an NLLB-200 model for translating a new language.
Tokenizer Update
Because the original NLLB tokenizer marked some Khakas characters as <unk> (unknown), the vocabulary was explicitly
extended. The update_tokenizer.py script:
- Trains a new SentencePiece model on the adeshkin/kjh-mono-sents to identify missing tokens.
- Modifies the underlying
sentencepiece_model_pb2protobuf of the NLLB tokenizer to append these new Khakas tokens. - Updates the overall vocabulary size of the tokenizer and the model embeddings.
Note: Running
update_tokenizer.pyrequires exactlytransformers==4.57.3due to how tokenizer internals are modified. However, the resulting model and tokenizer can be used with any later versions oftransformersfor both training and inference.
Training Hyperparameters
- Max sequence length: 128
- Batch size: 16 (per device) with 2 gradient accumulation steps
- Learning rate: 1e-4
- Optimizer: Adafactor
- LR scheduler: Cosine with warmup
- Warmup steps: 1,000
- Max steps: 200,000
- Precision: fp16 (autocast)
- Hardware: 1x NVIDIA Tesla T4 (Google Colab)
- Training time: ~10 hours
For full training details and scripts, see the khakas-mt repository.
Training Data
The training corpus consists of ~160k parallel sentence pairs. During training, the sampling ratio between translation directions was dynamically set to 60% (kjh → ru) and 40% (ru → kjh).
| Source | Pairs | Link |
|---|---|---|
| Khakas–Russian Parallel Corpus | 159,213 | adeshkin/khakas-russian-parallel-corpus |
| Google SmolSent | 863 | adeshkin/google-smol-en-ru-kjh (smolsent) |
| Google SmolDoc | 825 | adeshkin/google-smol-en-ru-kjh (smoldoc) |
Evaluation
Evaluated on the FLORES+ devtest split (1,012 sentence pairs) using SacreBLEU:
| Direction | BLEU | chrF++ |
|---|---|---|
| kjh → ru | 24.40 | 50.12 |
| ru → kjh | 19.09 | 51.10 |
FLORES+ dev split (997 sentences) was used for validation during training.
License
This model is distributed under the CC-BY-NC 4.0 License, as the base model is licensed under the same terms.
- Downloads last month
- 134
Model tree for adeshkin/nllb-200-distilled-600M-kjh-ru
Base model
facebook/nllb-200-distilled-600M