VOOZH about

URL: https://huggingface.co/xnpx/wav2vec2-large-xlsr-ipa-phonemes

⇱ xnpx/wav2vec2-large-xlsr-ipa-phonemes · Hugging Face


Wav2Vec2-Large-XLSR for Hindi IPA Phoneme Recognition

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Hindi speech to IPA phoneme recognition with CTC decoding.

Instead of transcribing to text, this model outputs individual IPA (International Phonetic Alphabet) phoneme tokens — making it useful for phonetic analysis, pronunciation assessment, forced alignment, and linguistic research on Hindi speech.

Model Details

  • Architecture: Wav2Vec2ForCTC (24 transformer layers, 1024 hidden size, 16 attention heads)
  • Base model: facebook/wav2vec2-large-xlsr-53
  • Fine-tuning dataset: AI4Bharat IndicVoices — Hindi subset
  • Vocabulary: 64 IPA phoneme tokens (including special tokens)
  • Sampling rate: 16 kHz
  • Framework: PyTorch / HuggingFace Transformers

Phoneme Vocabulary

The model recognizes 61 Hindi IPA phonemes plus 3 special tokens:

Category Phonemes
Vowels ə, ɑː, i, , u, , , , aːi, aːu
Plosives p, , b, , , t̪ʰ, , d̪ʰ, ʈ, ʈʰ, ɖ, ɖʰ, k, , g, , q
Affricates c, , ɟ, ɟʰ, ɕc
Fricatives s, z, ɕ, ʂ, h, ɦ, f, x, ɣ
Nasals m, n, ɲ, ɳ, ŋ,
Liquids & Glides l, r, ɾ, ɽ, ɽʱ, j, v
Clusters , t̪ɾ, gj
Syllabic , l̩ː, ɹ̩, ɹ̩ː
Special <pad> (CTC blank), <unk>, | (word delimiter)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model_name = "xnpx/wav2vec2-large-xlsr-ipa-phonemes"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
 waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
waveform = waveform.squeeze()

# Run inference
inputs = processor(waveform.numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
 logits = model(inputs.input_values).logits

# Greedy CTC decode
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
# Example output: "n ə m ə s t̪ eː"

With Timestamps (Greedy CTC)

import numpy as np

log_probs = torch.nn.functional.log_softmax(logits, dim=-1).cpu().numpy()[0]
pred_ids = np.argmax(log_probs, axis=-1)

# Load vocab for ID -> phoneme mapping
import json
vocab = json.loads(processor.tokenizer.backend_tokenizer.to_str()) if hasattr(processor.tokenizer, 'backend_tokenizer') else processor.tokenizer.get_vocab()
id_to_phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}

# Frame duration: product of conv_stride values / sampling_rate
# For this model: 5*2*2*2*2*2*2 = 320 samples per frame -> 20ms at 16kHz
frame_duration_s = 320 / 16000 # 0.02s per frame

phonemes, timestamps = [], []
prev_id = None
for frame_idx, token_id in enumerate(pred_ids):
 if token_id == 0: # skip CTC blank
 prev_id = None
 continue
 if token_id == prev_id: # skip CTC repeats
 continue
 prev_id = token_id
 phoneme = id_to_phoneme.get(int(token_id), "<unk>")
 if phoneme not in ("<pad>", "<unk>", "|"):
 t = frame_idx * frame_duration_s
 phonemes.append(phoneme)
 timestamps.append(t)

for p, t in zip(phonemes, timestamps):
 print(f" {t:.3f}s {p}")

Training Details

  • Base model: facebook/wav2vec2-large-xlsr-53 (pre-trained on 53 languages)
  • Dataset: AI4Bharat IndicVoices Hindi split
  • Text-to-phoneme conversion: Devanagari script → IPA via rule-based transliteration
  • Loss: CTC (Connectionist Temporal Classification)
  • Optimizer: AdamW
  • Training framework: HuggingFace Trainer

Limitations

  • Designed specifically for Hindi speech; may not generalize well to other languages
  • CTC-based — no language model or beam search (greedy decode only)
  • Phoneme boundaries from greedy decoding are approximate; use CTC segmentation for more accurate alignment
  • Performance may degrade on noisy or far-field audio

Citation

If you use this model, please cite the underlying wav2vec2-xlsr work:

@inproceedings{conneau2020unsupervised,
 title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
 author={Conneau, Alexis and Baevski, Alexei and Rothe, Henry and Araabi, Ali and Auli, Michael},
 booktitle={Interspeech},
 year={2020}
}
Downloads last month
34
Safetensors
Model size
0.3B params
Tensor type
F32
·

Model tree for xnpx/wav2vec2-large-xlsr-ipa-phonemes

Finetuned
(366)
this model

Dataset used to train xnpx/wav2vec2-large-xlsr-ipa-phonemes