Chatterbox TTS — Slovak fine-tune

Slovak (slovenčina) fine-tune of Resemble AI's Chatterbox Multilingual TTS. Drop-in T3 replacement weights — load the base ChatterboxMultilingualTTS, then swap in these Slovak weights to get high-quality Slovak speech with zero-shot voice cloning.

📝 Tuning guide: a 7-lessons writeup on fine-tuning Chatterbox for a low-resource language is also published on dev.to: Fine-tuning Chatterbox on a Low-Resource Language: 7 Things That Mattered (or see GUIDE.md in this repo for the bilingual EN+SK version).

🇸🇰 Slovenčina dole (Slovak description below).

What's in this repo

File	Size	What it is
`t3_sk_v2.2.safetensors`	~2 GB	Slovak T3 weights — production default
`GUIDE.md`	~12 KB	Practical tuning guide — 7 lessons from fine-tuning Chatterbox on a low-resource language (EN + SK)

This repo ships only model weights plus a few demo samples. You bring your own reference audio (3–10 s of clean Slovak speech) for voice cloning at inference time.

Demo samples

Generated with a Common Voice SK reference clip (CC-0). Reference audio not included — only model output.

Greeting — Dobrý deň, vitajte v ukážke slovenského syntetického hlasu.

Narrative — V Bratislave práve začína nový deň. Slnko vychádza nad Dunajom a mesto sa pomaly prebúdza.

Explanation — Tento model dokáže klonovať akýkoľvek hlas iba z niekoľkých sekúnd referenčnej nahrávky.

Long narrative (~30 s) — short text on Slovak language and its history (showcases prosody over a longer span).

Requirements

pip install chatterbox-tts torch torchaudio safetensors

GPU recommended (~3.5 GB VRAM). Runs on CPU but slowly.

Quickstart

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Load the base multilingual Chatterbox
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# 2) Download Slovak T3 weights and patch them in
sk_weights = hf_hub_download(
 repo_id="pekiskol/chatterbox-tts-slovak",
 filename="t3_sk_v2.2.safetensors",
)
state = load_file(sk_weights, device="cpu")

# Handle vocab size mismatch between SK fine-tune and base model
target_vocab = model.t3.text_emb.weight.shape[0]
src_vocab = state["text_emb.weight"].shape[0]
if src_vocab > target_vocab:
 state["text_emb.weight"] = state["text_emb.weight"][:target_vocab, :]
 state["text_head.weight"] = state["text_head.weight"][:target_vocab, :]
elif src_vocab < target_vocab:
 pad = target_vocab - src_vocab
 emb_pad = state["text_emb.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
 head_pad = state["text_head.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
 state["text_emb.weight"] = torch.cat([state["text_emb.weight"], emb_pad], dim=0)
 state["text_head.weight"] = torch.cat([state["text_head.weight"], head_pad], dim=0)

model.t3.load_state_dict(state, strict=True)
model.t3.to(device).eval()

# 3) Generate Slovak speech with zero-shot voice cloning
wav = model.generate(
 text="Ahoj, toto je ukážka slovenského hlasu generovaného modelom Chatterbox.",
 audio_prompt_path="path/to/your/reference.wav", # 3–10 s of clean SK speech
 language_id="sk",
)

import torchaudio
torchaudio.save("output.wav", wav, model.sr)

Tips for good results

Reference audio: 4–6 seconds of clean, dense speech works best. Avoid music, noise, and long silences.
Text length: split very long inputs into sentences or short paragraphs; the model can lose coherence on overly long generations.
Numbers and abbreviations: Slovak numbers, units (e.g. 20 %, Y100) and acronyms (e.g. NDA) are sometimes mispronounced. For production use, normalise text first (write dvadsať percent instead of 20 %, eN-Dý-Á instead of NDA).

Limitations

Slovak only — for other languages use the original Chatterbox Multilingual.
Quality depends heavily on the reference audio.
Code-switching (mixing Slovak with English in one sentence) can produce wrong pronunciation on the foreign words.
The model can occasionally produce quiet, garbled audio mid-utterance on hard inputs; usually fixed by re-generating or splitting the text.

License

This fine-tune is released under the MIT License, matching the base Chatterbox license. You are free to use it commercially.

When using this model, please credit:

This fine-tune (link to this repo)
Resemble AI Chatterbox (base model)

Citation

If this model is useful in your work, a citation/credit is appreciated:

@misc{chatterbox-tts-slovak,
 author = {pekiskol},
 title = {Chatterbox TTS — Slovak fine-tune},
 year = {2026},
 url = {https://huggingface.co/pekiskol/chatterbox-tts-slovak}
}

🇸🇰 Po slovensky

Toto je fine-tune modelu Chatterbox Multilingual TTS od Resemble AI, dotrénovaný na slovenčinu. Použitie:

Načítaš základný ChatterboxMultilingualTTS z Resemble AI.
Nahradíš T3 weights tými zo súboru t3_sk_v2.2.safetensors.
Generuješ slovenskú reč s zero-shot klonovaním hlasu — model skopíruje farbu hlasu z 3–10 sekundovej referenčnej nahrávky, ktorú dodáš.

Licencia: MIT — komerčné použitie povolené, stačí pri publikovaní uviesť odkaz na tento model aj na základný Chatterbox.

Reference audio: repo neobsahuje žiadne hlasové vzorky. Vlastný hlas (alebo hlas s explicitným súhlasom) si dodáš ty pri inferencii.

Downloads last month: -

Model tree for pekiskol/chatterbox-tts-slovak

Base model

ResembleAI/chatterbox

Finetuned

(54)

this model

URL: https://huggingface.co/pekiskol/chatterbox-tts-slovak

⇱ pekiskol/chatterbox-tts-slovak · Hugging Face