Phi-4 Multimodal ASR (Galician Fine-Tuned)

This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct adapted for automatic speech recognition (ASR) in Galician (gl).

It leverages Phi-4’s multimodal capabilities to process audio embeddings as input and generate autoregressive text transcriptions, providing a modern and flexible ASR approach beyond traditional encoder-decoder speech models.

The system is designed to deliver robust transcription quality across multiple Galician speech domains, including read speech, interviews, conversational speech, and broadcast media.

Model Architecture

The base model is Phi-4 Multimodal, a decoder-only multimodal transformer capable of handling text, audio, and image inputs.

For ASR:

Audio is converted into learned audio embeddings
These embeddings are inserted into the prompt as a special audio token
The model generates the transcription autoregressively as text

Only the audio embedding layers are fine-tuned, while the rest of the model remains frozen.

Training Data

The model was trained and evaluated on a multi-corpus Galician ASR dataset composed of public and curated speech resources.
All datasets were standardized into a homogeneous format and mounted inside the training container.

Datasets Included

Common Voice v23 (Galician) – volunteer read speech
FLEURS GL-EN – short TTS-style utterances
Transcrispeech (Galician) – radio, television and interviews
FalAI – mixed corpus including read and conversational speech
OpenSLR Galician – controlled read speech

These datasets provide coverage across clean read speech, semi-spontaneous speech, and more challenging real-world conditions.

Training Procedure

Fine-tuning was performed using a custom training pipeline implemented in finetune_phi4_gl_asr.py.

Training Strategy

Only parameters related to audio embeddings (audio_embed) were updated
All other model weights remained frozen
Supervised training using target Galician transcripts
A fixed prompt format was used:

<|audio_1|> Transcribe the audio clip into Galician text.

Target sequences were terminated with:
<|end|><|endoftext|>
Periodic evaluation every 20% of total training progress
Custom Trainer modifications to prevent out-of-memory (OOM) errors
Checkpoints saved every 1000 training steps

Hyperparameters

Epochs: 3
Learning rate: 5e-5
Batch size per GPU: 1
Gradient accumulation: 16
Precision: FP16
Gradient checkpointing: Enabled

Evaluation

Evaluation was conducted using the script inference.py.

The evaluation pipeline:

Loads the fine-tuned model from /workspace/model
Loads datasets from /mnt/datasets
Generates transcriptions using model.generate()
Applies text normalization using BasicTextNormalizer
Computes WER and CER using jiwer
Evaluates each dataset independently
Computes combined metrics across all datasets

Evaluation Results

Per-Dataset Results

Dataset	N	WER	CER
CommonVoice	14563	0.0290	0.0073
FLEURS	212	0.0771	0.0419
Transcrispeech	1710	0.1168	0.0487
FalAI	47760	0.0024	0.0012
OpenSLR	282	0.0493	0.0193
Combined	64527	0.0213	0.0081

Interpretation

Excellent performance on read speech (FalAI, Common Voice)
Solid robustness on broadcast and interview domains (Transcrispeech)
Very low CER, indicating clean and consistent character-level output
Good generalization across different speaking styles

Infrastructure

Training and evaluation were executed inside a Docker container with mounted datasets and model directories.

Docker Volume Configuration

- ./inference.py:/workspace/inference.py:ro
- /home/devbcp/Practicas/phi-4-mm-instruct-v1.0:/workspace/model:ro
- /home/devbcp/Proyectos/00-DATASETS/ASR:/mnt/datasets:ro
- ./outputs:/workspace/outputs

Contact information

For further information, send an email to proxecto.nos@usc.gal

Licensing information

Apache License, Version 2.0

Acknowledgements

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU).

Thanks also to Balidea for the technical development of this model.

Citation

@misc{proxectenos2026phi-4-multimodal-instruct-gl-v1.0,
 author = {{Proxecto Nós}},
 title = {{Phi-4 Multimodal ASR} (Galician Fine-Tuned) },
 year = {2026},
 publisher = {Hugging Face},
 howpublished = {\url{https://huggingface.co/proxectonos/phi-4-multimodal-instruct-gl-v1.0/}},
}

Downloads last month: 9

Safetensors

Model size

6B params

Tensor type

F32

Model tree for proxectonos/phi-4-multimodal-instruct-gl-v1.0

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(54)

this model

Datasets used to train proxectonos/phi-4-multimodal-instruct-gl-v1.0

Collection including proxectonos/phi-4-multimodal-instruct-gl-v1.0

Automatic Speech Recognition models • 5 items • Updated May 13

Evaluation results

WER on CommonVoice-v23-GL
self-reported
0.029
CER on CommonVoice-v23-GL
self-reported
0.007
WER on FLEURS-SpeechT-GL-EN
self-reported
0.077
CER on FLEURS-SpeechT-GL-EN
self-reported
0.042
WER on Transcrispeech-GL
self-reported
0.117
CER on Transcrispeech-GL
self-reported
0.049
WER on FalAI
self-reported
0.002
CER on FalAI
self-reported
0.001

URL: https://huggingface.co/proxectonos/phi-4-multimodal-instruct-gl-v1.0

⇱ proxectonos/phi-4-multimodal-instruct-gl-v1.0 · Hugging Face