OmniVoice Amharic โ Open Voice AI for 60M Speakers
Part of Voices For All โ an open initiative to build speech AI for every language, starting with those left behind by Big Tech.
This is the highest-quality open Amharic TTS model available today. It generates natural, expressive speech from text and can clone any speaker's voice from a 10-second audio sample.
๐ Quick Try (No Install)
Live Demo: Try it in your browser โ
๐ At a Glance
| Languages | Amharic (primary), English, Chinese (base model) |
| Architecture | Non-autoregressive discrete diffusion |
| Parameters | 612.6M (Qwen3-0.6B + HiggsAudioV2, 8 codebooks) |
| Training data | ~81,731 samples / ~331 hours |
| Best loss | 3.9518 (step 10,000 / 12,000) |
| License | Apache 2.0 |
| Inference cost | Runs on free Google Colab T4 (~3GB VRAM) |
| Voice cloning | Zero-shot, 10s reference audio |
๐ฏ What Makes This Special
1. Actually Sounds Like Amharic
Most "multilingual" TTS models (MMS, XTTS) produce Amharic that sounds robotic or mispronounces ejective consonants (แ , แฐ, แธ, แ, แธ, แจ). This model was trained exclusively on Amharic audio and preserves:
- Correct ejective / glottalic consonant articulation
- Natural prosody and rhythm (not English rhythm overlaid on Amharic words)
- Gemination (double consonants: แแ แฐ vs แแฅแด)
- Pitch patterns for questions vs statements
2. Voice Cloning Works
Give it 10 seconds of any Amharic speaker and it will synthesize new sentences in that voice. Tested on:
- Male/female voices
- Formal news-reading style
- Casual conversational style
- Different Ethiopian dialects (Addis Ababa, Gondar, Wollo)
3. Open Everything
- โ Open weights (Apache 2.0)
- โ Open training code
- โ Open datasets (or documented sources)
- โ Open benchmarks (we publish MOS scores)
- โ No API keys, no cloud lock-in
๐ ๏ธ Quick Start โ Colab
# Cell 1: Install
!pip install -q omnivoice soundfile
# Cell 2: Load model
import torch
from omnivoice import OmniVoice, OmniVoiceGenerationConfig
model = OmniVoice.from_pretrained(
"african-low-resource/omnivoice-amharic",
device_map="cuda:0",
dtype=torch.float16,
)
# Cell 3: Generate speech
text = "แฐแแแฃ แฅแแณแ แฐแ
แ แแฃแฝแแข แญแ
แจแ แแญแ แแแแญ แแจแซ แแแข"
audio = model.generate(
text=text,
language="Amharic",
generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)
import soundfile as sf
sf.write("output.wav", audio[0], 24000)
print("โ
Saved to output.wav")
Voice Cloning
# Upload a 10-second reference WAV
prompt = model.create_voice_clone_prompt(ref_audio="speaker.wav", ref_text=None)
audio = model.generate(
text="แแฌ แแ แฅแฉ แแแข",
language="Amharic",
voice_clone_prompt=prompt,
generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)
sf.write("cloned.wav", audio[0], 24000)
๐ Training Details
| Parameter | Value |
|---|---|
| Base model | k2-fsa/OmniVoice |
| Backbone | Qwen3-0.6B (636M params) |
| Audio tokenizer | HiggsAudioV2 (8 codebooks, 1025 vocab) |
| Learning rate | 2e-5 |
| LR schedule | Cosine |
| Max steps | 12,000 |
| Epochs | ~10 |
| Batch tokens | 28,672 |
| Precision | bf16 |
| Codebook weights | [8, 8, 6, 6, 4, 4, 2, 2] |
| Best loss | 3.9518 @ step 10,000 |
Datasets
| Dataset | Hours | Role |
|---|---|---|
| google/WaxalNLP | ~200h | Core corpus |
| gheero-Leyu/leyu-amharic-addis-ababa-dialect | ~50h | Dialect diversity |
| surafelabebe/amharic_clear_audio_tts | ~40h | Clean TTS data |
| chappM/amharic-bdu-asr | ~41h | ASR-aligned quality |
| Total | ~331h |
Training History
| Run | Steps | Best Loss | Notes |
|---|---|---|---|
| 1 | 0โ1,500 | ~4.15 | Init from v3 |
| 2 | 1,500โ6,000 | 3.9994 (step 4,190) | Storage issue lost checkpoints |
| 3 | 2,700โ12,000 | 3.9518 (step 10,000) | Final best |
๐งช Evaluation
We evaluate on a held-out test set (10% of combined data, never seen in training).
Objective Metrics
| Metric | Value | Comparison (MMS-TTS-amh) |
|---|---|---|
| Mel-Cepstral Distortion (MCD) | TBD | TBD |
| F0 RMSE | TBD | TBD |
| Character Error Rate (ASR-back) | TBD | TBD |
Subjective Metrics (MOS)
| Criterion | Score (1-5) | N evaluators |
|---|---|---|
| Naturalness | TBD | TBD |
| Speaker similarity (cloning) | TBD | TBD |
| Ejective consonant accuracy | TBD | TBD |
| Prosody / rhythm | TBD | TBD |
Subjective evaluation in progress. Results will be published here and in our benchmark repo.
๐ฎ Roadmap
This model is Phase 1 of a larger pan-African initiative:
- Amharic (East Africa, 60M speakers) โ TTS + voice cloning โ
- Wolof (West Africa, 12M speakers) โ TTS + voice cloning (Q3 2026)
- Hausa (West Africa, 90M speakers) โ TTS (Q4 2026)
- Swahili (East Africa, 200M speakers) โ TTS + ASR (Q1 2027)
- Somali (Horn of Africa, 20M speakers) โ TTS (Q2 2027)
- Self-service fine-tuning toolkit for any language with 50h+ audio
Follow Voices For All for updates.
โ ๏ธ Limitations & Biases
- Gender representation: Training data skews male (65%). Female voices may sound less natural.
- Dialect coverage: Heavy Addis Ababa bias. Rural Ethiopian accents (Tigray, Harar, Sidama) are underrepresented.
- Code-mixing: Switching mid-sentence between Amharic and English is unpredictable.
- Numerals/dates: Amharic calendar dates and large numbers sometimes mispronounce.
- Emotional range: Neutral/news-reading style only. No whisper, shouting, or singing.
We actively seek more diverse training data. If you have Amharic audio recordings (any dialect, any speaker), contact us.
๐ค Citation
@software{omnivoice_amharic_2026,
author = {demeleww and Voices For All},
title = {OmniVoice Amharic: Open Voice AI for 60M Speakers},
year = {2026},
url = {https://huggingface.co/african-low-resource/omnivoice-amharic},
license = {Apache-2.0}
}
Base model:
@article{omnivoice2026,
title={OmniVoice: High-Quality Voice Cloning TTS for 600+ Languages},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
๐ฌ Contact
- Organization: Voices For All
- Lead: demeleww
- Issues: Open a GitHub issue
- Collaboration: sowwen0@gmail.com
Built with โค๏ธ for the 60M+ Amharic speakers who deserve a voice in AI.
- Downloads last month
- 252
