Wav2Vec2-Large-XLSR for Hindi IPA Phoneme Recognition

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Hindi speech to IPA phoneme recognition with CTC decoding.

Instead of transcribing to text, this model outputs individual IPA (International Phonetic Alphabet) phoneme tokens — making it useful for phonetic analysis, pronunciation assessment, forced alignment, and linguistic research on Hindi speech.

Model Details

Architecture: Wav2Vec2ForCTC (24 transformer layers, 1024 hidden size, 16 attention heads)
Base model: facebook/wav2vec2-large-xlsr-53
Fine-tuning dataset: AI4Bharat IndicVoices — Hindi subset
Vocabulary: 64 IPA phoneme tokens (including special tokens)
Sampling rate: 16 kHz
Framework: PyTorch / HuggingFace Transformers

Phoneme Vocabulary

The model recognizes 61 Hindi IPA phonemes plus 3 special tokens:

Category	Phonemes
Vowels	`ə`, `ɑː`, `i`, `iː`, `u`, `uː`, `eː`, `oː`, `aːi`, `aːu`
Plosives	`p`, `pʰ`, `b`, `bʰ`, `t̪`, `t̪ʰ`, `d̪`, `d̪ʰ`, `ʈ`, `ʈʰ`, `ɖ`, `ɖʰ`, `k`, `kʰ`, `g`, `gʰ`, `q`
Affricates	`c`, `cʰ`, `ɟ`, `ɟʰ`, `ɕc`
Fricatives	`s`, `z`, `ɕ`, `ʂ`, `h`, `ɦ`, `f`, `x`, `ɣ`
Nasals	`m`, `n`, `ɲ`, `ɳ`, `ŋ`, `ⁿ`
Liquids & Glides	`l`, `r`, `ɾ`, `ɽ`, `ɽʱ`, `j`, `v`
Clusters	`kʃ`, `t̪ɾ`, `gj`
Syllabic	`l̩`, `l̩ː`, `ɹ̩`, `ɹ̩ː`
Special	`<pad>` (CTC blank), `<unk>`, `\|` (word delimiter)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model_name = "xnpx/wav2vec2-large-xlsr-ipa-phonemes"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
 waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
waveform = waveform.squeeze()

# Run inference
inputs = processor(waveform.numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
 logits = model(inputs.input_values).logits

# Greedy CTC decode
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
# Example output: "n ə m ə s t̪ eː"

With Timestamps (Greedy CTC)

import numpy as np

log_probs = torch.nn.functional.log_softmax(logits, dim=-1).cpu().numpy()[0]
pred_ids = np.argmax(log_probs, axis=-1)

# Load vocab for ID -> phoneme mapping
import json
vocab = json.loads(processor.tokenizer.backend_tokenizer.to_str()) if hasattr(processor.tokenizer, 'backend_tokenizer') else processor.tokenizer.get_vocab()
id_to_phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}

# Frame duration: product of conv_stride values / sampling_rate
# For this model: 5*2*2*2*2*2*2 = 320 samples per frame -> 20ms at 16kHz
frame_duration_s = 320 / 16000 # 0.02s per frame

phonemes, timestamps = [], []
prev_id = None
for frame_idx, token_id in enumerate(pred_ids):
 if token_id == 0: # skip CTC blank
 prev_id = None
 continue
 if token_id == prev_id: # skip CTC repeats
 continue
 prev_id = token_id
 phoneme = id_to_phoneme.get(int(token_id), "<unk>")
 if phoneme not in ("<pad>", "<unk>", "|"):
 t = frame_idx * frame_duration_s
 phonemes.append(phoneme)
 timestamps.append(t)

for p, t in zip(phonemes, timestamps):
 print(f" {t:.3f}s {p}")

Training Details

Base model: facebook/wav2vec2-large-xlsr-53 (pre-trained on 53 languages)
Dataset: AI4Bharat IndicVoices Hindi split
Text-to-phoneme conversion: Devanagari script → IPA via rule-based transliteration
Loss: CTC (Connectionist Temporal Classification)
Optimizer: AdamW
Training framework: HuggingFace Trainer

Limitations

Designed specifically for Hindi speech; may not generalize well to other languages
CTC-based — no language model or beam search (greedy decode only)
Phoneme boundaries from greedy decoding are approximate; use CTC segmentation for more accurate alignment
Performance may degrade on noisy or far-field audio

Citation

If you use this model, please cite the underlying wav2vec2-xlsr work:

@inproceedings{conneau2020unsupervised,
 title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
 author={Conneau, Alexis and Baevski, Alexei and Rothe, Henry and Araabi, Ali and Auli, Michael},
 booktitle={Interspeech},
 year={2020}
}

Downloads last month: 34

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for xnpx/wav2vec2-large-xlsr-ipa-phonemes

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(366)

this model

URL: https://huggingface.co/xnpx/wav2vec2-large-xlsr-ipa-phonemes

⇱ xnpx/wav2vec2-large-xlsr-ipa-phonemes · Hugging Face