VOOZH about

URL: https://huggingface.co/Shinzmann/sorotts

⇱ Shinzmann/sorotts · Hugging Face


SoroTTS: a natural voice for Yorùbá, Hausa, Igbo, and Nigerian Pidgin

SoroTTS (from sọ̀rọ̀, "speak" in Yorùbá) is a LoRA fine-tune of Orpheus-3B that gives natural, expressive text-to-speech in four Nigerian languages, Yorùbá, Hausa, Igbo, and Nigerian Pidgin, from a single adapter under the 4B "Tiny Titan" line.

It powers Naija Solar, a voice-first solar-sizing app that reads your plan aloud in your own language, and it was built for the Build Small Hackathon (Gradio × Hugging Face).

🔊 Listen  ·  🟢 Try it in Naija Solar  ·  🎥 Demo video  ·  💻 Training code on GitHub

🔊 Listen

Hear each language (cleanest voice per language):

Language Voice Sample
Yorùbá Yor1 yor_Yor1.wav
Hausa Hau1 hau_Hau1.wav
Igbo Ibo1 ibo_Ibo1.wav
Nigerian Pidgin NaijaA pcm_NaijaA.wav

Why it exists

Off-the-shelf TTS for Nigerian languages is either robotic (Meta MMS) or simply absent. Orpheus is one of the most natural open speech LLMs, but it is English-first and speaks no Nigerian language. SoroTTS closes that gap. It keeps Orpheus's natural prosody while speaking Yorùbá, Hausa, Igbo, and the language no public model spoke before, Nigerian Pidgin, the everyday tongue of perhaps a hundred million people.

What it is

  • Architecture: Orpheus-3B, a Llama-3B backbone that predicts SNAC 24 kHz neural audio codes. About 3B parameters, under the 4B line.
  • Base: hypaai/hypaai_orpheus_v5 (Hypa-Orpheus), already adapted to Yorùbá, Hausa, and Igbo. SoroTTS adds Nigerian Pidgin and reinforces the other three.
  • This repo: a single LoRA adapter (r=64) covering all four languages, plus the tokenizer. Attach it to the base at inference time (see Quickstart).

Voices

Each clean data source becomes a named voice tag that you prepend to the text. An Orpheus speech LLM conditions on the voice tag, so more clean sources means more coverage. The cleanest source per language is voice 1:

Language Cleanest voice Other voices
Yorùbá Yor1 Yor2, Yor3
Hausa Hau1 Hau2, ... , HauBible (BibleTTS studio)
Igbo Ibo1 Ibo2, Ibo3
Nigerian Pidgin NaijaA NaijaB, ... (one per speaker)

English falls back to a base Orpheus / Hypa voice (e.g. Eniola).

Tip: write Yorùbá and Igbo with diacritics (tone marks and dots) and Hausa with hooked letters (ɓ ɗ ƙ). The model learned diacritised text, so Ẹ kú àbọ̀ sounds far better than E ku abo.

🚀 Quickstart

pip install unsloth snac soundfile peft huggingface_hub
import os, torch, soundfile as sf
from huggingface_hub import snapshot_download
from unsloth import FastLanguageModel
from peft import PeftModel
from snac import SNAC

# 1) load the base WITHOUT its bundled adapter, then attach SoroTTS
local = snapshot_download("hypaai/hypaai_orpheus_v5", ignore_patterns=["adapter_*"])
for f in ("adapter_config.json", "adapter_model.safetensors", "adapter_model.bin"):
 p = os.path.join(local, f)
 if os.path.exists(p):
 os.remove(p)

model, tok = FastLanguageModel.from_pretrained(local, max_seq_length=2048, dtype=None, load_in_4bit=False)
model = PeftModel.from_pretrained(model, "Shinzmann/sorotts") # <- the SoroTTS adapter
FastLanguageModel.for_inference(model)
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cuda").eval()

# 2) Orpheus token scheme (must match training)
SOH, EOT, EOH, SOS, EOS, EOAI, OFF = 128259, 128009, 128260, 128257, 128258, 128262, 128266
DEV = next(snac.parameters()).device

def _decode(codes):
 cl = lambda v: 0 if v < 0 else (4095 if v > 4095 else v) # clamp so a stray code can't crash SNAC
 l1, l2, l3 = [], [], []
 for i in range((len(codes) + 1) // 7):
 b = 7 * i
 l1.append(cl(codes[b])); l2.append(cl(codes[b+1] - 4096))
 l3.append(cl(codes[b+2] - 2*4096)); l3.append(cl(codes[b+3] - 3*4096))
 l2.append(cl(codes[b+4] - 4*4096))
 l3.append(cl(codes[b+5] - 5*4096)); l3.append(cl(codes[b+6] - 6*4096))
 c = [torch.tensor(x).unsqueeze(0).to(DEV) for x in (l1, l2, l3)]
 with torch.inference_mode():
 return snac.decode(c).squeeze().cpu().numpy()

def speak(text, voice="Yor1", max_new_tokens=1024):
 ids = tok(f"{voice}: {text}", return_tensors="pt").input_ids
 ids = torch.cat([torch.tensor([[SOH]]), ids, torch.tensor([[EOT, EOH]])], dim=1).to(model.device)
 out = model.generate(input_ids=ids, attention_mask=torch.ones_like(ids),
 max_new_tokens=max_new_tokens, do_sample=True, temperature=0.55,
 top_p=0.95, repetition_penalty=1.1,
 eos_token_id=[EOS, EOAI], use_cache=True)[0] # stop on EOS or EOAI
 sos = (out == SOS).nonzero(as_tuple=True)[0]
 seq = out[sos[-1].item() + 1:] if len(sos) else out
 seq = seq[seq >= OFF] # keep only SNAC audio codes
 n = (seq.size(0) // 7) * 7
 return _decode([t.item() - OFF for t in seq[:n]])

wave = speak("Ẹ kú àbọ̀ sí Nàìjíríà, orílẹ̀-èdè wa tó kún fún ìbùkún.", voice="Yor1")
sf.write("out.wav", wave, 24000)

Two gotchas worth knowing (both handled above):

  1. Stop on EOS or EOAI. The model often ends a clip with EOAI (128262); if you stop only on EOS it will run to the token cap and babble.
  2. Keep only audio codes, and clamp them. Filter to tokens >= 128266, then clamp each SNAC value to [0, 4095]. A stray control token left in the stream corrupts the 7-token frame alignment and crashes the SNAC decoder.

Generate one sentence per call (each call has a roughly 2048-token, about 15 second budget) and stitch sentences with a short silence for longer text.

Training data

31,574 SNAC-tokenised clips (0.7 to 15 seconds each), streamed and ranked cleanest-first, across:

Corpus Languages License Role
NaijaVoices yor / hau / ibo CC-BY-NC-SA the bulk (about 600h per language available)
WAXAL TTS yor / hau / ibo CC-BY clean single-speaker
FLEURS yor / hau / ibo CC-BY read speech
BibleTTS hau CC-BY-SA studio 48 kHz (HauBible)
Nigerian Pidgin v1.0 pcm CC-BY the Pidgin anchor (the new language)

Because NaijaVoices (non-commercial) is included, the model is stamped cc-by-nc-sa-4.0. A permissive, commercial variant is possible by training on the CC-BY and CC-BY-SA sources only (the training script has a --commercial-only flag).

How it was trained

  • LoRA r=64, α=64, dropout 0, on all attention and MLP projections (the Orpheus-TTS standard), via Unsloth. Only 97.3M parameters are trained, 2.86% of the 3.40B-parameter model.
  • 2 epochs = 7,894 steps, bf16, AdamW-8bit, lr 2e-4, 3% warmup, total batch 8 (8 per device, 1 gradient-accumulation step), on a single NVIDIA B200. The run takes about 31 minutes (1,894 s, roughly 33 samples per second) and converges to a train loss near 3.5.
  • Audio is encoded to Orpheus's 7-tokens-per-frame SNAC stream; each training sequence is [SOH] voice: text [EOT][EOH] [SOAI][SOS] <snac codes> [EOS][EOAI].
  • Fully reproducible on Modal: streaming and SNAC-encoding the data, the LoRA training, and the Hub push all run as one serverless job. See modal/finetune_orpheus.py (train), serving_tts.py (serve), and test_sorotts.py (samples).

Intended use and limitations

  • Intended use: accessibility and voice interfaces in Nigerian languages, reading text aloud, IVR, narration. Built for Naija Solar.
  • Speed: Orpheus is autoregressive (roughly 10 to 25 seconds of compute per sentence on an A100 or H200). Great behind a "written-first" UI; for real time, serve via vLLM.
  • English is the base model's, not specifically fine-tuned here.
  • Pidgin has the thinnest data (one clean 4 to 5 hour corpus). It works because Pidgin is English-lexified and Orpheus has a strong English prior, but a dedicated single-speaker corpus would sharpen it.
  • Numbers and units: like most TTS, it reads digits and abbreviations literally. For natural speech, spell numbers as words (Naija Solar does this per language).
  • Ethics: this is a voice for accessibility. Do not use it to clone a real person's voice without consent.

License

CC-BY-NC-SA-4.0, inherited from the NaijaVoices data. The base model and the other corpora carry their own licenses; see each link above. Please attribute SoroTTS and share derivatives alike.

Acknowledgements

Built on Hypa-Orpheus, Orpheus-TTS, SNAC, and Unsloth, with data from NaijaVoices, Google WAXAL and FLEURS, BibleTTS, and the Nigerian Pidgin ASR Corpus.

Citation

@misc{ashinze2026sorotts,
 title = {SoroTTS: a natural voice for Yoruba, Hausa, Igbo, and Nigerian Pidgin},
 author = {Ashinze, Emmanuel},
 year = {2026},
 howpublished = {\url{https://huggingface.co/Shinzmann/sorotts}},
 note = {LoRA fine-tune of Orpheus-3B, built for the Build Small Hackathon}
}

By Ashinze Emmanuel, for the Build Small Hackathon.

Downloads last month
255

Model tree for Shinzmann/sorotts

Adapter
(1)
this model

Datasets used to train Shinzmann/sorotts

Spaces using Shinzmann/sorotts 2

Article mentioning Shinzmann/sorotts