VOOZH about

URL: https://huggingface.co/NMikka/F5-TTS-Georgian

⇱ NMikka/F5-TTS-Georgian · Hugging Face


F5-TTS Georgian

A fine-tuned version of SWivid/F5-TTS (335M params) for Georgian text-to-speech. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.

Model Details

Base model SWivid/F5-TTS v1 Base (335M params, DiT + ConvNeXt V2)
Fine-tuning Full fine-tune (continuation of flow-matching pretraining), no LoRA
Training data NMikka/Common-Voice-Geo-Cleaned — 20,300 samples, 12 speakers
Training 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB)
Sample rate 24 kHz
Voice cloning Works well with training speakers; generalizing to new voices is WIP
License CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights)

Evaluation — FLEURS Georgian Benchmark (979 unseen samples)

Round-trip CER: TTS generates audio → Meta Omnilingual ASR 7B transcribes → compare to original text.

Metric Value
CER mean 0.0509
CER median 0.0309
CER p90 0.1183
CER std 0.0558
WER mean 0.1866
WER median 0.1600

CER distribution:

  • 65.9% of samples < 5% CER
  • 85.9% of samples < 10% CER
  • 96.5% of samples < 20% CER
  • 0 catastrophic failures (> 50% CER)

Evaluated with speaker 3 reference audio (NISQA MOS 4.99).

Usage

Install

pip install f5-tts

Download Model

from huggingface_hub import hf_hub_download

# Download checkpoint and vocab
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

Inference

The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.

from datasets import load_dataset
from huggingface_hub import hf_hub_download
from f5_tts.api import F5TTS
import soundfile as sf
import numpy as np

# Download model
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

# Load a reference sample from the training dataset
ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
ref_sample = ds[92] # Pick any sample as voice reference, but this one is the one i used while testing alot.

# Save reference audio to temp file (F5-TTS expects a file path)
ref_path = "/tmp/ref.wav"
sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])

# Load model
model = F5TTS(
 ckpt_file=ckpt_path,
 vocab_file=vocab_path,
 device="cuda",
 use_ema=False, # Important: this checkpoint was not trained with EMA
)

# Generate speech using a training speaker as reference
wav, sr, _ = model.infer(
 ref_file=ref_path,
 ref_text=ref_sample["text"],
 gen_text="საქართველო მდებარეობს კავკასიის რეგიონში, ევროპისა და აზიის გასაყარზე",
)
sf.write("output.wav", wav, sr)

Generation Parameters

wav, sr, _ = model.infer(
 ref_file="reference.wav",
 ref_text="reference transcript",
 gen_text="text to synthesize",
 nfe_step=32, # Denoising steps (default 32, higher = better quality, slower)
 cfg_strength=2.0, # Classifier-free guidance (default 2.0)
 speed=1.0, # Speech speed multiplier
)

Training Details

Method Full fine-tune (flow-matching loss, continuation of pretraining)
Base checkpoint F5TTS_v1_Base/model_1250000.safetensors
Learning rate 1e-5
Warmup 500 steps
Batch size 9,600 audio frames per GPU
Max sequences/batch 64
Optimizer 8-bit Adam (bitsandbytes)
Epochs 100
Total updates 110,000
Tokenizer Character-level (char, not pinyin)
Vocab 2,579 tokens (2,545 pretrained + 34 Georgian characters)
GPU 1x NVIDIA RTX A6000 (48GB)

Vocab Extension

The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-ჰ + „). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 → 2,580 dimensions.

Limitations and Future Work

  • License: CC-BY-NC-4.0 — non-commercial use only (inherited from F5-TTS weights)
  • Voice cloning to new speakers is limited — the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
  • Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
  • Some complex Georgian text with rare characters may produce higher error rates
  • No emotion or prosody control beyond what the reference audio provides

Part of the Georgian TTS Benchmark

This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: github.com/NMikaa/TTS_pipelines

Citation

@misc{f5tts-georgian-2026,
 title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
 author={NMikka},
 year={2026},
 url={https://huggingface.co/NMikka/F5-TTS-Georgian}
}
Downloads last month
72

Model tree for NMikka/F5-TTS-Georgian

Base model

SWivid/F5-TTS
Finetuned
(131)
this model

Dataset used to train NMikka/F5-TTS-Georgian