Chatterbox TTS — Slovak fine-tune
Slovak (slovenčina) fine-tune of Resemble AI's Chatterbox Multilingual TTS.
Drop-in T3 replacement weights — load the base ChatterboxMultilingualTTS,
then swap in these Slovak weights to get high-quality Slovak speech with
zero-shot voice cloning.
📝 Tuning guide: a 7-lessons writeup on fine-tuning Chatterbox for a
low-resource language is also published on dev.to:
Fine-tuning Chatterbox on a Low-Resource Language: 7 Things That Mattered
(or see GUIDE.md in this repo for the bilingual EN+SK version).
🇸🇰 Slovenčina dole (Slovak description below).
What's in this repo
| File | Size | What it is |
|---|---|---|
t3_sk_v2.2.safetensors |
~2 GB | Slovak T3 weights — production default |
GUIDE.md |
~12 KB | Practical tuning guide — 7 lessons from fine-tuning Chatterbox on a low-resource language (EN + SK) |
This repo ships only model weights plus a few demo samples. You bring your own reference audio (3–10 s of clean Slovak speech) for voice cloning at inference time.
Demo samples
Generated with a Common Voice SK reference clip (CC-0). Reference audio not included — only model output.
Greeting — Dobrý deň, vitajte v ukážke slovenského syntetického hlasu.
Narrative — V Bratislave práve začína nový deň. Slnko vychádza nad Dunajom a mesto sa pomaly prebúdza.
Explanation — Tento model dokáže klonovať akýkoľvek hlas iba z niekoľkých sekúnd referenčnej nahrávky.
Long narrative (~30 s) — short text on Slovak language and its history (showcases prosody over a longer span).
Requirements
pip install chatterbox-tts torch torchaudio safetensors
GPU recommended (~3.5 GB VRAM). Runs on CPU but slowly.
Quickstart
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1) Load the base multilingual Chatterbox
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
# 2) Download Slovak T3 weights and patch them in
sk_weights = hf_hub_download(
repo_id="pekiskol/chatterbox-tts-slovak",
filename="t3_sk_v2.2.safetensors",
)
state = load_file(sk_weights, device="cpu")
# Handle vocab size mismatch between SK fine-tune and base model
target_vocab = model.t3.text_emb.weight.shape[0]
src_vocab = state["text_emb.weight"].shape[0]
if src_vocab > target_vocab:
state["text_emb.weight"] = state["text_emb.weight"][:target_vocab, :]
state["text_head.weight"] = state["text_head.weight"][:target_vocab, :]
elif src_vocab < target_vocab:
pad = target_vocab - src_vocab
emb_pad = state["text_emb.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
head_pad = state["text_head.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
state["text_emb.weight"] = torch.cat([state["text_emb.weight"], emb_pad], dim=0)
state["text_head.weight"] = torch.cat([state["text_head.weight"], head_pad], dim=0)
model.t3.load_state_dict(state, strict=True)
model.t3.to(device).eval()
# 3) Generate Slovak speech with zero-shot voice cloning
wav = model.generate(
text="Ahoj, toto je ukážka slovenského hlasu generovaného modelom Chatterbox.",
audio_prompt_path="path/to/your/reference.wav", # 3–10 s of clean SK speech
language_id="sk",
)
import torchaudio
torchaudio.save("output.wav", wav, model.sr)
Tips for good results
- Reference audio: 4–6 seconds of clean, dense speech works best. Avoid music, noise, and long silences.
- Text length: split very long inputs into sentences or short paragraphs; the model can lose coherence on overly long generations.
- Numbers and abbreviations: Slovak numbers, units (e.g.
20 %,Y100) and acronyms (e.g.NDA) are sometimes mispronounced. For production use, normalise text first (writedvadsať percentinstead of20 %,eN-Dý-Áinstead ofNDA).
Limitations
- Slovak only — for other languages use the original Chatterbox Multilingual.
- Quality depends heavily on the reference audio.
- Code-switching (mixing Slovak with English in one sentence) can produce wrong pronunciation on the foreign words.
- The model can occasionally produce quiet, garbled audio mid-utterance on hard inputs; usually fixed by re-generating or splitting the text.
License
This fine-tune is released under the MIT License, matching the base Chatterbox license. You are free to use it commercially.
When using this model, please credit:
- This fine-tune (link to this repo)
- Resemble AI Chatterbox (base model)
Citation
If this model is useful in your work, a citation/credit is appreciated:
@misc{chatterbox-tts-slovak,
author = {pekiskol},
title = {Chatterbox TTS — Slovak fine-tune},
year = {2026},
url = {https://huggingface.co/pekiskol/chatterbox-tts-slovak}
}
🇸🇰 Po slovensky
Toto je fine-tune modelu Chatterbox Multilingual TTS od Resemble AI, dotrénovaný na slovenčinu. Použitie:
- Načítaš základný
ChatterboxMultilingualTTSz Resemble AI. - Nahradíš T3 weights tými zo súboru
t3_sk_v2.2.safetensors. - Generuješ slovenskú reč s zero-shot klonovaním hlasu — model skopíruje farbu hlasu z 3–10 sekundovej referenčnej nahrávky, ktorú dodáš.
Licencia: MIT — komerčné použitie povolené, stačí pri publikovaní uviesť odkaz na tento model aj na základný Chatterbox.
Reference audio: repo neobsahuje žiadne hlasové vzorky. Vlastný hlas (alebo hlas s explicitným súhlasom) si dodáš ty pri inferencii.
- Downloads last month
- -
Model tree for pekiskol/chatterbox-tts-slovak
Base model
ResembleAI/chatterbox