paza-Phi-4-multimodal-instruct

Model Overview

This is a fine-tuned version of microsoft/Phi-4-multimodal-instruct for automatic speech recognition (ASR) in Swahili, Kalenjin, Kikuyu, Luo, Maasai and Somali. The model retains the base model’s transformer-based architecture but is optimized for audio transcription.

Fine-tuning was performed on the entire unified multilingual ASR dataset, comprising the mentioned six languages, to encourage cross-lingual generalization. During fine-tuning, only the audio-specific components: audio embedding module, audio encoder, and audio projection layers, were unfrozen and set as trainable, while the rest of the model parameters remained frozen to preserve pretrained language capabilities. Dropout was applied to both the audio encoder and projection layers to regularize training. The model leverages a multimodal processor that handles text tokenization and audio feature extraction, allowing seamless integration of audio inputs into the transformer architecture.

Alignment approach

Usage

Data Overview

Training Data

The model was finetuned on the Africa Next Voices Kenya, DigiGreen Kikuyu, a proprietary Kikuyu dataset and the Swahili split of the Mozilla Common Voice dataset.

Audio samples longer than 30 seconds were excluded, following the recommendations in the input formats documentation.

Data distribution by language

👁 alt text
Figure 1: Data distribution by language

Training Procedure

Training Hyperparameters

Quality and performance evaluation

These are the results from the test splits of all the datasets mentioned in the data distribution chart as of December 08, 2025. Because the training data is imbalanced across languages (see the data distribution chart), gains correlate with data volume.

The fine-tuned model demonstrates significant improvements in both Word Error Rate (WER) and Character Error Rate (CER) across multiple languages compared to the base model. Overall, the fine-tuned model consistently outperforms the base across languages, with variance reflecting the underlying language distribution.

Note: The Kikuyu evaluation results are computed using the test splits of all Kikuyu datasets listed above, including the proprietary dataset.

Character Error Rate Comparison Across languages

👁 alt text
Figure 2: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned model. Lower CER indicates better transcription performance.

Word Error Rate Comparison Across languages

👁 alt text
Figure 3: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned model. Lower WER indicates better transcription performance.

Comparison Across SOTA models

We benchmarked our fine-tuned models against 3 state-of-the-art models - Meta’s facebook/omniASR-LLM-7B, facebook/mms-1b-all and OpenAI's openai/whisper-large-v3-turbo. This set provides a balanced comparison across large‑scale multi-lingual, low‑resource, and leading ASR models

👁 alt text
Figure 4: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance.

👁 alt text
Figure 5: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower WER indicates better transcription performance.

Technical requirements and integration guidance

Responsible AI considerations

Long Context

Safety evaluation and red-teaming

Contact

Requests for additional information may be directed to

Authorized representative: Microsoft Ireland Operations Limited 70 Sir John Rogerson’s Quay, Dublin 2, D02 R296, Ireland 

MSFTAIActRequest@microsoft.com

Downloads last month: 148

Safetensors

Model size

6B params

Tensor type

F32

Model tree for microsoft/paza-Phi-4-multimodal-instruct

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(50)

this model

Space using microsoft/paza-Phi-4-multimodal-instruct 1

Collection including microsoft/paza-Phi-4-multimodal-instruct

Paza is a collection of speech models & benchmarks for low resource languages by the Microsoft Research Africa - Nairobi Lab • 4 items • Updated Apr 24 • 8

URL: https://huggingface.co/microsoft/paza-Phi-4-multimodal-instruct

⇱ microsoft/paza-Phi-4-multimodal-instruct · Hugging Face