SoroTTS: a natural voice for Yorùbá, Hausa, Igbo, and Nigerian Pidgin
SoroTTS (from sọ̀rọ̀, "speak" in Yorùbá) is a LoRA fine-tune of Orpheus-3B that gives natural, expressive text-to-speech in four Nigerian languages, Yorùbá, Hausa, Igbo, and Nigerian Pidgin, from a single adapter under the 4B "Tiny Titan" line.
It powers Naija Solar, a voice-first solar-sizing app that reads your plan aloud in your own language, and it was built for the Build Small Hackathon (Gradio × Hugging Face).
🔊 Listen · 🟢 Try it in Naija Solar · 🎥 Demo video · 💻 Training code on GitHub
🔊 Listen
Hear each language (cleanest voice per language):
| Language | Voice | Sample |
|---|---|---|
| Yorùbá | Yor1 |
yor_Yor1.wav |
| Hausa | Hau1 |
hau_Hau1.wav |
| Igbo | Ibo1 |
ibo_Ibo1.wav |
| Nigerian Pidgin | NaijaA |
pcm_NaijaA.wav |
Why it exists
Off-the-shelf TTS for Nigerian languages is either robotic (Meta MMS) or simply absent. Orpheus is one of the most natural open speech LLMs, but it is English-first and speaks no Nigerian language. SoroTTS closes that gap. It keeps Orpheus's natural prosody while speaking Yorùbá, Hausa, Igbo, and the language no public model spoke before, Nigerian Pidgin, the everyday tongue of perhaps a hundred million people.
What it is
- Architecture: Orpheus-3B, a Llama-3B backbone that predicts SNAC 24 kHz neural audio codes. About 3B parameters, under the 4B line.
- Base:
hypaai/hypaai_orpheus_v5(Hypa-Orpheus), already adapted to Yorùbá, Hausa, and Igbo. SoroTTS adds Nigerian Pidgin and reinforces the other three. - This repo: a single LoRA adapter (r=64) covering all four languages, plus the tokenizer. Attach it to the base at inference time (see Quickstart).
Voices
Each clean data source becomes a named voice tag that you prepend to the text. An Orpheus speech LLM conditions on the voice tag, so more clean sources means more coverage. The cleanest source per language is voice 1:
| Language | Cleanest voice | Other voices |
|---|---|---|
| Yorùbá | Yor1 |
Yor2, Yor3 |
| Hausa | Hau1 |
Hau2, ... , HauBible (BibleTTS studio) |
| Igbo | Ibo1 |
Ibo2, Ibo3 |
| Nigerian Pidgin | NaijaA |
NaijaB, ... (one per speaker) |
English falls back to a base Orpheus / Hypa voice (e.g. Eniola).
Tip: write Yorùbá and Igbo with diacritics (tone marks and dots) and Hausa with hooked letters (ɓ ɗ ƙ). The model learned diacritised text, so
Ẹ kú àbọ̀sounds far better thanE ku abo.
🚀 Quickstart
pip install unsloth snac soundfile peft huggingface_hub
import os, torch, soundfile as sf
from huggingface_hub import snapshot_download
from unsloth import FastLanguageModel
from peft import PeftModel
from snac import SNAC
# 1) load the base WITHOUT its bundled adapter, then attach SoroTTS
local = snapshot_download("hypaai/hypaai_orpheus_v5", ignore_patterns=["adapter_*"])
for f in ("adapter_config.json", "adapter_model.safetensors", "adapter_model.bin"):
p = os.path.join(local, f)
if os.path.exists(p):
os.remove(p)
model, tok = FastLanguageModel.from_pretrained(local, max_seq_length=2048, dtype=None, load_in_4bit=False)
model = PeftModel.from_pretrained(model, "Shinzmann/sorotts") # <- the SoroTTS adapter
FastLanguageModel.for_inference(model)
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cuda").eval()
# 2) Orpheus token scheme (must match training)
SOH, EOT, EOH, SOS, EOS, EOAI, OFF = 128259, 128009, 128260, 128257, 128258, 128262, 128266
DEV = next(snac.parameters()).device
def _decode(codes):
cl = lambda v: 0 if v < 0 else (4095 if v > 4095 else v) # clamp so a stray code can't crash SNAC
l1, l2, l3 = [], [], []
for i in range((len(codes) + 1) // 7):
b = 7 * i
l1.append(cl(codes[b])); l2.append(cl(codes[b+1] - 4096))
l3.append(cl(codes[b+2] - 2*4096)); l3.append(cl(codes[b+3] - 3*4096))
l2.append(cl(codes[b+4] - 4*4096))
l3.append(cl(codes[b+5] - 5*4096)); l3.append(cl(codes[b+6] - 6*4096))
c = [torch.tensor(x).unsqueeze(0).to(DEV) for x in (l1, l2, l3)]
with torch.inference_mode():
return snac.decode(c).squeeze().cpu().numpy()
def speak(text, voice="Yor1", max_new_tokens=1024):
ids = tok(f"{voice}: {text}", return_tensors="pt").input_ids
ids = torch.cat([torch.tensor([[SOH]]), ids, torch.tensor([[EOT, EOH]])], dim=1).to(model.device)
out = model.generate(input_ids=ids, attention_mask=torch.ones_like(ids),
max_new_tokens=max_new_tokens, do_sample=True, temperature=0.55,
top_p=0.95, repetition_penalty=1.1,
eos_token_id=[EOS, EOAI], use_cache=True)[0] # stop on EOS or EOAI
sos = (out == SOS).nonzero(as_tuple=True)[0]
seq = out[sos[-1].item() + 1:] if len(sos) else out
seq = seq[seq >= OFF] # keep only SNAC audio codes
n = (seq.size(0) // 7) * 7
return _decode([t.item() - OFF for t in seq[:n]])
wave = speak("Ẹ kú àbọ̀ sí Nàìjíríà, orílẹ̀-èdè wa tó kún fún ìbùkún.", voice="Yor1")
sf.write("out.wav", wave, 24000)
Two gotchas worth knowing (both handled above):
- Stop on
EOSorEOAI. The model often ends a clip withEOAI(128262); if you stop only onEOSit will run to the token cap and babble. - Keep only audio codes, and clamp them. Filter to tokens
>= 128266, then clamp each SNAC value to[0, 4095]. A stray control token left in the stream corrupts the 7-token frame alignment and crashes the SNAC decoder.
Generate one sentence per call (each call has a roughly 2048-token, about 15 second budget) and stitch sentences with a short silence for longer text.
Training data
31,574 SNAC-tokenised clips (0.7 to 15 seconds each), streamed and ranked cleanest-first, across:
| Corpus | Languages | License | Role |
|---|---|---|---|
| NaijaVoices | yor / hau / ibo | CC-BY-NC-SA | the bulk (about 600h per language available) |
| WAXAL TTS | yor / hau / ibo | CC-BY | clean single-speaker |
| FLEURS | yor / hau / ibo | CC-BY | read speech |
| BibleTTS | hau | CC-BY-SA | studio 48 kHz (HauBible) |
| Nigerian Pidgin v1.0 | pcm | CC-BY | the Pidgin anchor (the new language) |
Because NaijaVoices (non-commercial) is included, the model is stamped cc-by-nc-sa-4.0. A permissive, commercial variant is possible by training on the CC-BY and CC-BY-SA sources only (the training script has a --commercial-only flag).
How it was trained
- LoRA r=64, α=64, dropout 0, on all attention and MLP projections (the Orpheus-TTS standard), via Unsloth. Only 97.3M parameters are trained, 2.86% of the 3.40B-parameter model.
- 2 epochs = 7,894 steps, bf16, AdamW-8bit, lr 2e-4, 3% warmup, total batch 8 (8 per device, 1 gradient-accumulation step), on a single NVIDIA B200. The run takes about 31 minutes (1,894 s, roughly 33 samples per second) and converges to a train loss near 3.5.
- Audio is encoded to Orpheus's 7-tokens-per-frame SNAC stream; each training sequence is
[SOH] voice: text [EOT][EOH] [SOAI][SOS] <snac codes> [EOS][EOAI]. - Fully reproducible on Modal: streaming and SNAC-encoding the data, the LoRA training, and the Hub push all run as one serverless job. See
modal/finetune_orpheus.py(train),serving_tts.py(serve), andtest_sorotts.py(samples).
Intended use and limitations
- Intended use: accessibility and voice interfaces in Nigerian languages, reading text aloud, IVR, narration. Built for Naija Solar.
- Speed: Orpheus is autoregressive (roughly 10 to 25 seconds of compute per sentence on an A100 or H200). Great behind a "written-first" UI; for real time, serve via vLLM.
- English is the base model's, not specifically fine-tuned here.
- Pidgin has the thinnest data (one clean 4 to 5 hour corpus). It works because Pidgin is English-lexified and Orpheus has a strong English prior, but a dedicated single-speaker corpus would sharpen it.
- Numbers and units: like most TTS, it reads digits and abbreviations literally. For natural speech, spell numbers as words (Naija Solar does this per language).
- Ethics: this is a voice for accessibility. Do not use it to clone a real person's voice without consent.
License
CC-BY-NC-SA-4.0, inherited from the NaijaVoices data. The base model and the other corpora carry their own licenses; see each link above. Please attribute SoroTTS and share derivatives alike.
Acknowledgements
Built on Hypa-Orpheus, Orpheus-TTS, SNAC, and Unsloth, with data from NaijaVoices, Google WAXAL and FLEURS, BibleTTS, and the Nigerian Pidgin ASR Corpus.
Citation
@misc{ashinze2026sorotts,
title = {SoroTTS: a natural voice for Yoruba, Hausa, Igbo, and Nigerian Pidgin},
author = {Ashinze, Emmanuel},
year = {2026},
howpublished = {\url{https://huggingface.co/Shinzmann/sorotts}},
note = {LoRA fine-tune of Orpheus-3B, built for the Build Small Hackathon}
}
By Ashinze Emmanuel, for the Build Small Hackathon.
- Downloads last month
- 255
Model tree for Shinzmann/sorotts
Base model
hypaai/hypaai_orpheus_v5