paza-whisper-large-v3-turbo
Model Details
This model is a fine-tuned version of the openai/whisper-large-v3-turbo model finetuned for automatic speech recognition (ASR) in several Kenyan languages, including Swahili, Kalenjin, Kikuyu, Luo, Maasai and Somali. Whisper is a transformer-based encoder-decoder model that converts raw audio into text. The encoder processes audio inputs as log-Mel spectrograms, capturing acoustic and linguistic features, while the decoder generates text tokens in an autoregressive manner. This design allows the model to handle diverse languages, accents, and noise conditions with strong generalization.
Fine-tuning was performed on the entire unified multilingual ASR dataset, which includes the mentioned six languages, to encourage cross-lingual generalization. The fine-tuning process involved continued supervised training on labeled audio-text pairs, adjusting all the model’s parameters to better capture the phonetic and linguistic patterns unique to them. As a result, this model provides improved transcription accuracy for low-resource speech recognition tasks while maintaining Whisper’s robustness and efficiency.
Alignment approach
Usage
Data overview
Training Data
The model was finetuned on the Africa Next Voices Kenya, DigiGreen Kikuyu, a proprietary Kikuyu dataset and the Swahili split of the Mozilla Common Voice dataset.
Due to the model’s maximum input length of 448 tokens, audio samples exceeding this limit were discarded during tokenization. Performance in each language correlates strongly with the amount of training data available for that language.
Data distribution by language
👁 alt text
Figure 1: Data distribution by language
Training Procedure
Training Hyperparameters
Quality and performance evaluation
These are the results from the test splits of all the datasets mentioned in the data distribution chart as of December 08, 2025.
Because the training data is imbalanced across languages (see the data distribution chart), gains correlate with data volume.
The fine-tuned model demonstrates significant improvements in both Word Error Rate (WER) and Character Error Rate (CER) across multiple languages compared to the base model. Overall, the fine-tuned model consistently outperforms the base across languages, with variance reflecting the underlying language distribution.
Note: The Kikuyu evaluation results are computed using the test splits of all Kikuyu datasets listed above, including the proprietary dataset.
Character Error Rate Comparison Across languages
👁 CER Comparison
Figure 3: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned model. Lower CER indicates better transcription performance.
Word Error Rate Comparison Across languages
👁 WER Comparison
Figure 2: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned model. Lower CER indicates better transcription performance.
Comparison Across SOTA models
We benchmarked our fine-tuned models against 3 state-of-the-art models - Meta’s facebook/omniASR-LLM-7B, facebook/mms-1b-all and OpenAI's openai/whisper-large-v3-turbo. This set provides a balanced comparison across large‑scale multi-lingual, low‑resource, and leading ASR models
👁 alt text
Figure 4: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance.
👁 alt text
Figure 5: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance.
Technical requirements and integration guidance
Responsible AI considerations
Long context
Safety evaluation and red-teaming
Contact
Requests for additional information may be directed to
Authorized representative: Microsoft Ireland Operations Limited 70 Sir John Rogerson’s Quay, Dublin 2, D02 R296, Ireland
- Downloads last month
- 411
