nllb-200-distilled-600M-kjh-ru

A Khakas ↔ Russian machine translation model created by fine-tuning facebook/nllb-200-distilled-600M on a Khakas–Russian parallel corpus.

Khakas (Хакас тілі) is a low-resource Turkic language spoken in the Republic of Khakassia, Russia.

Quick Start

Requirements: transformers, torch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "adeshkin/nllb-200-distilled-600M-kjh-ru"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# --- Khakas → Russian ---
src_lang = "kjh_Cyrl" # Khakas
tgt_lang = "rus_Cyrl" # Russian
text = '54. "Ат ӱгредерде арғамҷың пик ползын, чонға чоохтирда чооғың сын ползын" сӧспектің тузазын чарыда пас пиріңер.'

tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=1024)

model.eval()
with torch.no_grad():
 outputs = model.generate(
 **inputs.to(model.device),
 forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
 max_new_tokens=int(32 + 3 * inputs.input_ids.shape[1]),
 num_beams=4,
 )

result = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)
print(result)
# 54. Объясните значение пословицы "При обучении коня пусть будет крепкий верёв твой, при обращении к народу пусть будет истинно слово твоё."

To translate in the opposite direction, set src_lang = "rus_Cyrl" and tgt_lang = "kjh_Cyrl".

Base Model

facebook/nllb-200-distilled-600M is a encoder-decoder machine translation model, belonging to the NLLB (No Language Left Behind) family. It supports translation among 200 languages. For this model, the original NLLB tokenizer and embedding layer were extended to support the Khakas language ( kjh_Cyrl), initialized using Kazakh (kaz_Cyrl) as a structurally similar language.

Fine-Tuning

The model was full fine-tuned on a mixture of datasets containing Khakas–Russian sentence pairs. The overall training and tokenizer extension approach is based on the guide How to fine-tune an NLLB-200 model for translating a new language.

Tokenizer Update

Because the original NLLB tokenizer marked some Khakas characters as <unk> (unknown), the vocabulary was explicitly extended. The update_tokenizer.py script:

Trains a new SentencePiece model on the adeshkin/kjh-mono-sents to identify missing tokens.
Modifies the underlying sentencepiece_model_pb2 protobuf of the NLLB tokenizer to append these new Khakas tokens.
Updates the overall vocabulary size of the tokenizer and the model embeddings.

Note: Running update_tokenizer.py requires exactly transformers==4.57.3 due to how tokenizer internals are modified. However, the resulting model and tokenizer can be used with any later versions of transformers for both training and inference.

Training Hyperparameters

Max sequence length: 128
Batch size: 16 (per device) with 2 gradient accumulation steps
Learning rate: 1e-4
Optimizer: Adafactor
LR scheduler: Cosine with warmup
Warmup steps: 1,000
Max steps: 200,000
Precision: fp16 (autocast)
Hardware: 1x NVIDIA Tesla T4 (Google Colab)
Training time: ~10 hours

For full training details and scripts, see the khakas-mt repository.

Training Data

The training corpus consists of ~160k parallel sentence pairs. During training, the sampling ratio between translation directions was dynamically set to 60% (kjh → ru) and 40% (ru → kjh).

Source	Pairs	Link
Khakas–Russian Parallel Corpus	159,213	adeshkin/khakas-russian-parallel-corpus
Google SmolSent	863	adeshkin/google-smol-en-ru-kjh (smolsent)
Google SmolDoc	825	adeshkin/google-smol-en-ru-kjh (smoldoc)

Evaluation

Evaluated on the FLORES+ devtest split (1,012 sentence pairs) using SacreBLEU:

Direction	BLEU	chrF++
kjh → ru	24.40	50.12
ru → kjh	19.09	51.10

FLORES+ dev split (997 sentences) was used for validation during training.

License

This model is distributed under the CC-BY-NC 4.0 License, as the base model is licensed under the same terms.

Downloads last month: 134

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for adeshkin/nllb-200-distilled-600M-kjh-ru

Base model

facebook/nllb-200-distilled-600M

Finetuned

(305)

this model

Datasets used to train adeshkin/nllb-200-distilled-600M-kjh-ru

Collection including adeshkin/nllb-200-distilled-600M-kjh-ru

6 items • Updated 20 days ago

URL: https://huggingface.co/adeshkin/nllb-200-distilled-600M-kjh-ru

⇱ adeshkin/nllb-200-distilled-600M-kjh-ru · Hugging Face