Phi-4 Multimodal ASR (Galician Fine-Tuned)
This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct adapted for automatic speech recognition (ASR) in Galician (gl).
It leverages Phi-4’s multimodal capabilities to process audio embeddings as input and generate autoregressive text transcriptions, providing a modern and flexible ASR approach beyond traditional encoder-decoder speech models.
The system is designed to deliver robust transcription quality across multiple Galician speech domains, including read speech, interviews, conversational speech, and broadcast media.
Model Architecture
The base model is Phi-4 Multimodal, a decoder-only multimodal transformer capable of handling text, audio, and image inputs.
For ASR:
- Audio is converted into learned audio embeddings
- These embeddings are inserted into the prompt as a special audio token
- The model generates the transcription autoregressively as text
Only the audio embedding layers are fine-tuned, while the rest of the model remains frozen.
Training Data
The model was trained and evaluated on a multi-corpus Galician ASR dataset composed of public and curated speech resources.
All datasets were standardized into a homogeneous format and mounted inside the training container.
Datasets Included
- Common Voice v23 (Galician) – volunteer read speech
- FLEURS GL-EN – short TTS-style utterances
- Transcrispeech (Galician) – radio, television and interviews
- FalAI – mixed corpus including read and conversational speech
- OpenSLR Galician – controlled read speech
These datasets provide coverage across clean read speech, semi-spontaneous speech, and more challenging real-world conditions.
Training Procedure
Fine-tuning was performed using a custom training pipeline implemented in finetune_phi4_gl_asr.py.
Training Strategy
- Only parameters related to audio embeddings (
audio_embed) were updated - All other model weights remained frozen
- Supervised training using target Galician transcripts
- A fixed prompt format was used:
<|audio_1|> Transcribe the audio clip into Galician text.
Target sequences were terminated with:
<|end|><|endoftext|>Periodic evaluation every 20% of total training progress
Custom Trainer modifications to prevent out-of-memory (OOM) errors
Checkpoints saved every 1000 training steps
Hyperparameters
- Epochs: 3
- Learning rate: 5e-5
- Batch size per GPU: 1
- Gradient accumulation: 16
- Precision: FP16
- Gradient checkpointing: Enabled
Evaluation
Evaluation was conducted using the script inference.py.
The evaluation pipeline:
- Loads the fine-tuned model from
/workspace/model - Loads datasets from
/mnt/datasets - Generates transcriptions using
model.generate() - Applies text normalization using
BasicTextNormalizer - Computes WER and CER using
jiwer - Evaluates each dataset independently
- Computes combined metrics across all datasets
Evaluation Results
Per-Dataset Results
| Dataset | N | WER | CER |
|---|---|---|---|
| CommonVoice | 14563 | 0.0290 | 0.0073 |
| FLEURS | 212 | 0.0771 | 0.0419 |
| Transcrispeech | 1710 | 0.1168 | 0.0487 |
| FalAI | 47760 | 0.0024 | 0.0012 |
| OpenSLR | 282 | 0.0493 | 0.0193 |
| Combined | 64527 | 0.0213 | 0.0081 |
Interpretation
- Excellent performance on read speech (FalAI, Common Voice)
- Solid robustness on broadcast and interview domains (Transcrispeech)
- Very low CER, indicating clean and consistent character-level output
- Good generalization across different speaking styles
Infrastructure
Training and evaluation were executed inside a Docker container with mounted datasets and model directories.
Docker Volume Configuration
- ./inference.py:/workspace/inference.py:ro
- /home/devbcp/Practicas/phi-4-mm-instruct-v1.0:/workspace/model:ro
- /home/devbcp/Proyectos/00-DATASETS/ASR:/mnt/datasets:ro
- ./outputs:/workspace/outputs
Contact information
For further information, send an email to proxecto.nos@usc.gal
Licensing information
Acknowledgements
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU).
Thanks also to Balidea for the technical development of this model.
Citation
@misc{proxectenos2026phi-4-multimodal-instruct-gl-v1.0,
author = {{Proxecto Nós}},
title = {{Phi-4 Multimodal ASR} (Galician Fine-Tuned) },
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/proxectonos/phi-4-multimodal-instruct-gl-v1.0/}},
}
- Downloads last month
- 9
Model tree for proxectonos/phi-4-multimodal-instruct-gl-v1.0
Base model
microsoft/Phi-4-multimodal-instructDatasets used to train proxectonos/phi-4-multimodal-instruct-gl-v1.0
Collection including proxectonos/phi-4-multimodal-instruct-gl-v1.0
Evaluation results
- WER on CommonVoice-v23-GLself-reported0.029
- CER on CommonVoice-v23-GLself-reported0.007
- WER on FLEURS-SpeechT-GL-ENself-reported0.077
- CER on FLEURS-SpeechT-GL-ENself-reported0.042
- WER on Transcrispeech-GLself-reported0.117
- CER on Transcrispeech-GLself-reported0.049
- WER on FalAIself-reported0.002
- CER on FalAIself-reported0.001
