VOOZH about

URL: https://huggingface.co/mohammed/fastconformer-quran-ar

⇱ mohammed/fastconformer-quran-ar · Hugging Face


FastConformer Quran Arabic ASR

A fine-tuned NVIDIA FastConformer Hybrid Large model for Quranic Arabic speech recognition, achieving 0.14% Word Error Rate on the tarteel-ai/everyayah validation set.

This model supports both offline transcription (full bilateral context, highest accuracy) and real-time streaming (causal local attention, cache-aware frame-by-frame inference).


Model Details

Property Value
Base model nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0
Architecture EncDecHybridRNNTCTCBPE (FastConformer-Large)
Parameters 114.6M
Encoder layers 18 × FastConformer blocks
Tokenizer SentencePiece BPE, 1024 tokens
Sample rate 16 kHz, mono
Val WER (offline) 0.0014 (0.14%)
Dataset tarteel-ai/everyayah
Framework NVIDIA NeMo

Training

Fine-tuned using a 3-phase progressive unfreezing strategy on a single NVIDIA RTX 4070 Ti (12 GB):

Phase Layers unfrozen Steps LR Val WER
Phase 1 Top 3 encoder + decoder 2000 5e-5 0.0038
Phase 2 Upper half (layers 9–17) + decoder 3000 1e-4 0.0018
Phase 3 All layers 2500 5e-5 0.0014

Progressive unfreezing prevents catastrophic forgetting of the base model's Arabic speech representations while allowing the full model to adapt to Quranic phonetics, tajweed rules, and recitation style.

Training data: tarteel-ai/everyayah — a diverse multi-reciter dataset of complete Quranic recitations at multiple audio qualities, covering all 114 surahs across dozens of reciters.


Usage

Installation

pip install nemo_toolkit[asr]

Offline transcription (recommended for files)

The .nemo file is saved with full bilateral attention context — transcribe() works out of the box with no configuration required.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
 "mohammed/fastconformer-quran-ar"
)
model.eval()

# Transcribe a .wav file (16kHz mono)
result = model.transcribe(["recitation.wav"])
print(result[0].text)
# e.g. "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"

Real-time streaming

The model supports cache-aware streaming inference via NeMo's cache_aware_stream_step(). The key loading sequence (order matters):

import torch
import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf, open_dict

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
 "mohammed/fastconformer-quran-ar"
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Step 1 — reset conv padding to symmetric (safety check before mode switch)
for layer in model.encoder.layers:
 if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
 conv = layer.conv.conv
 ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
 conv.padding = ((ks - 1) // 2, (ks - 1) // 2)

# Step 2 — switch to causal local attention for streaming
model.change_attention_model(
 self_attention_model="rel_pos_local_attn",
 att_context_size=[128, 0], # 128 frames lookback (~10s), fully causal
)
with open_dict(model.cfg):
 model.cfg.encoder.conv_context_size = "causal"

# Step 3 — causal conv padding
for layer in model.encoder.layers:
 if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
 conv = layer.conv.conv
 ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
 conv.padding = (ks - 1, 0)

# Step 4 — greedy decoder
decoding_cfg = OmegaConf.structured(model.cfg.decoding)
OmegaConf.set_struct(decoding_cfg, False)
decoding_cfg.strategy = "greedy"
decoding_cfg.greedy.max_symbols = 10
decoding_cfg.greedy.use_cuda_graph_decoder = False # incompatible with streaming
model.change_decoding_strategy(decoding_cfg)
model.eval()

# Streaming loop — feed 80ms PCM int16 frames
cache_last_channel, cache_last_time = None, None
chunk_samples = 16000 * 1600 // 1000 # 1600ms chunk

audio_chunk = torch.zeros(1, chunk_samples, device=device) # replace with real audio
audio_len = torch.tensor([chunk_samples], device=device)

with torch.no_grad():
 processed, processed_len = model.preprocessor(
 input_signal=audio_chunk, length=audio_len
 )
 encoded, encoded_len, cache_last_channel, cache_last_time, _ = (
 model.encoder.cache_aware_stream_step(
 processed_signal=processed,
 processed_signal_length=processed_len,
 cache_last_channel=cache_last_channel,
 cache_last_time=cache_last_time,
 keep_all_outputs=False,
 )
 )

For a complete streaming implementation with microphone input, silence detection, word callbacks, and a FastAPI WebSocket server, see the companion script in the repository files.


Qualitative Examples

The following are exact reference vs. predicted outputs from the validation set — the model transcribed these word-for-word correctly, including full diacritisation (tashkeel):

Reference Predicted
وَهُوَ الَّذِي جَعَلَ لَكُمُ اللَّيْلَ لِبَاسًا وَالنَّوْمَ سُبَاتًا وَجَعَلَ النَّهَارَ نُشُورًا ✅ Perfect
الزَّانِي لَا يَنْكِحُ إِلَّا زَانِيَةً أَوْ مُشْرِكَةً وَالزَّانِيَةُ لَا يَنْكِحُهَا إِلَّا زَانٍ أَوْ مُشْرِكٌ وَحُرِّمَ ذَلِكَ عَلَى الْمُؤْمِنِينَ ✅ Perfect
إِلَّا مَنْ تَابَ وَآمَنَ وَعَمِلَ عَمَلًا صَالِحًا فَأُولَئِكَ يُبَدِّلُ اللَّهُ سَيِّئَاتِهِمْ حَسَنَاتٍ وَكَانَ اللَّهُ غَفُورًا رَحِيمًا ✅ Perfect
إِذْ قَالَ لِأَبِيهِ وَقَوْمِهِ مَا تَعْبُدُونَ ✅ Perfect
يَوْمَ لَا يَنْفَعُ مَالٌ وَلَا بَنُونَ ✅ Perfect
إِذْ قَالَ لَهُمْ أَخُوهُمْ هُودٌ أَلَا تَتَّقُونَ ✅ Perfect
أَتَبْنُونَ بِكُلِّ رِيعٍ آيَةً تَعْبَثُونَ ✅ Perfect
فَنَجَّيْنَاهُ وَأَهْلَهُ أَجْمَعِينَ ✅ Perfect
فَقَرَأَهُ عَلَيْهِمْ مَا كَانُوا بِهِ مُؤْمِنِينَ ✅ Perfect
وَأَنْذِرْ عَشِيرَتَكَ الْأَقْرَبِينَ ✅ Perfect
الَّذِينَ يُقِيمُونَ الصَّلَاةَ وَيُؤْتُونَ الزَّكَاةَ وَهُمْ بِالْآخِرَةِ هُمْ يُوقِنُونَ ✅ Perfect

These span multiple surahs (Al-Furqan, An-Nur, Ash-Shu'ara, As-Saffat) and include some of the most phonetically demanding ayahs in the Quran — long compound sentences, rare vocabulary (نُشُورًا، سُبَاتًا), emphatic consonants, and precise tashkeel on every word.


Intended Use & Limitations

Intended use:

  • Quranic recitation transcription and verification
  • Tajweed learning applications
  • Ayah identification from audio
  • Recitation correction apps (compare hypothesis against reference ayah)

Limitations:

  • Optimised specifically for Quranic Arabic — performance on Modern Standard Arabic or dialectal Arabic will be significantly lower than the base model
  • Best results on clean, single-speaker recitation audio at 16kHz
  • The streaming mode introduces ~1.6s of latency per chunk due to the encoder's minimum chunk size requirement

Citation

If you use this model, please cite the base model and dataset:

@misc{fastconformer-quran-ar,
 author = {Mohammed},
 title = {FastConformer Quran Arabic ASR},
 year = {2026},
 publisher = {Hugging Face},
 url = {https://huggingface.co/mohammed/fastconformer-quran-ar}
}
@misc{everyayah,
 author = {Tarteel AI},
 title = {EveryAyah: A Quranic Recitation Dataset},
 publisher = {Hugging Face},
 url = {https://huggingface.co/datasets/tarteel-ai/everyayah}
}
Downloads last month
313

Dataset used to train mohammed/fastconformer-quran-ar

Spaces using mohammed/fastconformer-quran-ar 3

Evaluation results