paza-Phi-4-multimodal-instruct
Model Overview
This is a fine-tuned version of microsoft/Phi-4-multimodal-instruct for automatic speech recognition (ASR) in Swahili, Kalenjin, Kikuyu, Luo, Maasai and Somali. The model retains the base model’s transformer-based architecture but is optimized for audio transcription.
Fine-tuning was performed on the entire unified multilingual ASR dataset, comprising the mentioned six languages, to encourage cross-lingual generalization. During fine-tuning, only the audio-specific components: audio embedding module, audio encoder, and audio projection layers, were unfrozen and set as trainable, while the rest of the model parameters remained frozen to preserve pretrained language capabilities. Dropout was applied to both the audio encoder and projection layers to regularize training. The model leverages a multimodal processor that handles text tokenization and audio feature extraction, allowing seamless integration of audio inputs into the transformer architecture.
Alignment approach
Usage
Data Overview
Training Data
The model was finetuned on the Africa Next Voices Kenya, DigiGreen Kikuyu, a proprietary Kikuyu dataset and the Swahili split of the Mozilla Common Voice dataset.
Audio samples longer than 30 seconds were excluded, following the recommendations in the input formats documentation.
Data distribution by language
👁 alt text
Figure 1: Data distribution by language
Training Procedure
Training Hyperparameters
Quality and performance evaluation
These are the results from the test splits of all the datasets mentioned in the data distribution chart as of December 08, 2025.
Because the training data is imbalanced across languages (see the data distribution chart), gains correlate with data volume.
The fine-tuned model demonstrates significant improvements in both Word Error Rate (WER) and Character Error Rate (CER) across multiple languages compared to the base model. Overall, the fine-tuned model consistently outperforms the base across languages, with variance reflecting the underlying language distribution.
Note: The Kikuyu evaluation results are computed using the test splits of all Kikuyu datasets listed above, including the proprietary dataset.
Character Error Rate Comparison Across languages
👁 alt text
Figure 2: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned model. Lower CER indicates better transcription performance.
Word Error Rate Comparison Across languages
👁 alt text
Figure 3: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned model. Lower WER indicates better transcription performance.
Comparison Across SOTA models
We benchmarked our fine-tuned models against 3 state-of-the-art models - Meta’s facebook/omniASR-LLM-7B, facebook/mms-1b-all and OpenAI's openai/whisper-large-v3-turbo. This set provides a balanced comparison across large‑scale multi-lingual, low‑resource, and leading ASR models
👁 alt text
Figure 4: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance.
👁 alt text
Figure 5: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower WER indicates better transcription performance.
Technical requirements and integration guidance
Responsible AI considerations
Long Context
Safety evaluation and red-teaming
Contact
Requests for additional information may be directed to
Authorized representative: Microsoft Ireland Operations Limited 70 Sir John Rogerson’s Quay, Dublin 2, D02 R296, Ireland
- Downloads last month
- 148
Model tree for microsoft/paza-Phi-4-multimodal-instruct
Base model
microsoft/Phi-4-multimodal-instruct