VOOZH about

URL: https://huggingface.co/proxectonos/phi-4-multimodal-instruct-gl-v1.0

⇱ proxectonos/phi-4-multimodal-instruct-gl-v1.0 · Hugging Face


Phi-4 Multimodal ASR (Galician Fine-Tuned)

This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct adapted for automatic speech recognition (ASR) in Galician (gl).

It leverages Phi-4’s multimodal capabilities to process audio embeddings as input and generate autoregressive text transcriptions, providing a modern and flexible ASR approach beyond traditional encoder-decoder speech models.

The system is designed to deliver robust transcription quality across multiple Galician speech domains, including read speech, interviews, conversational speech, and broadcast media.


Model Architecture

The base model is Phi-4 Multimodal, a decoder-only multimodal transformer capable of handling text, audio, and image inputs.

For ASR:

  • Audio is converted into learned audio embeddings
  • These embeddings are inserted into the prompt as a special audio token
  • The model generates the transcription autoregressively as text

Only the audio embedding layers are fine-tuned, while the rest of the model remains frozen.


Training Data

The model was trained and evaluated on a multi-corpus Galician ASR dataset composed of public and curated speech resources.
All datasets were standardized into a homogeneous format and mounted inside the training container.

Datasets Included

  • Common Voice v23 (Galician) – volunteer read speech
  • FLEURS GL-EN – short TTS-style utterances
  • Transcrispeech (Galician) – radio, television and interviews
  • FalAI – mixed corpus including read and conversational speech
  • OpenSLR Galician – controlled read speech

These datasets provide coverage across clean read speech, semi-spontaneous speech, and more challenging real-world conditions.


Training Procedure

Fine-tuning was performed using a custom training pipeline implemented in finetune_phi4_gl_asr.py.

Training Strategy

  • Only parameters related to audio embeddings (audio_embed) were updated
  • All other model weights remained frozen
  • Supervised training using target Galician transcripts
  • A fixed prompt format was used:
<|audio_1|> Transcribe the audio clip into Galician text.
  • Target sequences were terminated with:
    <|end|><|endoftext|>

  • Periodic evaluation every 20% of total training progress

  • Custom Trainer modifications to prevent out-of-memory (OOM) errors

  • Checkpoints saved every 1000 training steps

Hyperparameters

  • Epochs: 3
  • Learning rate: 5e-5
  • Batch size per GPU: 1
  • Gradient accumulation: 16
  • Precision: FP16
  • Gradient checkpointing: Enabled

Evaluation

Evaluation was conducted using the script inference.py.

The evaluation pipeline:

  1. Loads the fine-tuned model from /workspace/model
  2. Loads datasets from /mnt/datasets
  3. Generates transcriptions using model.generate()
  4. Applies text normalization using BasicTextNormalizer
  5. Computes WER and CER using jiwer
  6. Evaluates each dataset independently
  7. Computes combined metrics across all datasets

Evaluation Results

Per-Dataset Results

Dataset N WER CER
CommonVoice 14563 0.0290 0.0073
FLEURS 212 0.0771 0.0419
Transcrispeech 1710 0.1168 0.0487
FalAI 47760 0.0024 0.0012
OpenSLR 282 0.0493 0.0193
Combined 64527 0.0213 0.0081

Interpretation

  • Excellent performance on read speech (FalAI, Common Voice)
  • Solid robustness on broadcast and interview domains (Transcrispeech)
  • Very low CER, indicating clean and consistent character-level output
  • Good generalization across different speaking styles

Infrastructure

Training and evaluation were executed inside a Docker container with mounted datasets and model directories.

Docker Volume Configuration

- ./inference.py:/workspace/inference.py:ro
- /home/devbcp/Practicas/phi-4-mm-instruct-v1.0:/workspace/model:ro
- /home/devbcp/Proyectos/00-DATASETS/ASR:/mnt/datasets:ro
- ./outputs:/workspace/outputs

Contact information

For further information, send an email to proxecto.nos@usc.gal

Licensing information

Apache License, Version 2.0

Acknowledgements

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU).

Thanks also to Balidea for the technical development of this model.

Citation

@misc{proxectenos2026phi-4-multimodal-instruct-gl-v1.0,
 author = {{Proxecto Nós}},
 title = {{Phi-4 Multimodal ASR} (Galician Fine-Tuned) },
 year = {2026},
 publisher = {Hugging Face},
 howpublished = {\url{https://huggingface.co/proxectonos/phi-4-multimodal-instruct-gl-v1.0/}},
}
Downloads last month
9
Safetensors
Model size
6B params
Tensor type
F32
·

Model tree for proxectonos/phi-4-multimodal-instruct-gl-v1.0

Finetuned
(54)
this model

Datasets used to train proxectonos/phi-4-multimodal-instruct-gl-v1.0

Collection including proxectonos/phi-4-multimodal-instruct-gl-v1.0

Evaluation results