FastConformer Quran Arabic ASR

A fine-tuned NVIDIA FastConformer Hybrid Large model for Quranic Arabic speech recognition, achieving 0.14% Word Error Rate on the tarteel-ai/everyayah validation set.

This model supports both offline transcription (full bilateral context, highest accuracy) and real-time streaming (causal local attention, cache-aware frame-by-frame inference).

Model Details

Property	Value
Base model	`nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0`
Architecture	EncDecHybridRNNTCTCBPE (FastConformer-Large)
Parameters	114.6M
Encoder layers	18 × FastConformer blocks
Tokenizer	SentencePiece BPE, 1024 tokens
Sample rate	16 kHz, mono
Val WER (offline)	0.0014 (0.14%)
Dataset	tarteel-ai/everyayah
Framework	NVIDIA NeMo

Training

Fine-tuned using a 3-phase progressive unfreezing strategy on a single NVIDIA RTX 4070 Ti (12 GB):

Phase	Layers unfrozen	Steps	LR	Val WER
Phase 1	Top 3 encoder + decoder	2000	5e-5	0.0038
Phase 2	Upper half (layers 9–17) + decoder	3000	1e-4	0.0018
Phase 3	All layers	2500	5e-5	0.0014

Progressive unfreezing prevents catastrophic forgetting of the base model's Arabic speech representations while allowing the full model to adapt to Quranic phonetics, tajweed rules, and recitation style.

Training data: tarteel-ai/everyayah — a diverse multi-reciter dataset of complete Quranic recitations at multiple audio qualities, covering all 114 surahs across dozens of reciters.

Usage

Installation

pip install nemo_toolkit[asr]

Offline transcription (recommended for files)

The .nemo file is saved with full bilateral attention context — transcribe() works out of the box with no configuration required.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
 "mohammed/fastconformer-quran-ar"
)
model.eval()

# Transcribe a .wav file (16kHz mono)
result = model.transcribe(["recitation.wav"])
print(result[0].text)
# e.g. "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"

Real-time streaming

The model supports cache-aware streaming inference via NeMo's cache_aware_stream_step(). The key loading sequence (order matters):

import torch
import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf, open_dict

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
 "mohammed/fastconformer-quran-ar"
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Step 1 — reset conv padding to symmetric (safety check before mode switch)
for layer in model.encoder.layers:
 if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
 conv = layer.conv.conv
 ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
 conv.padding = ((ks - 1) // 2, (ks - 1) // 2)

# Step 2 — switch to causal local attention for streaming
model.change_attention_model(
 self_attention_model="rel_pos_local_attn",
 att_context_size=[128, 0], # 128 frames lookback (~10s), fully causal
)
with open_dict(model.cfg):
 model.cfg.encoder.conv_context_size = "causal"

# Step 3 — causal conv padding
for layer in model.encoder.layers:
 if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
 conv = layer.conv.conv
 ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
 conv.padding = (ks - 1, 0)

# Step 4 — greedy decoder
decoding_cfg = OmegaConf.structured(model.cfg.decoding)
OmegaConf.set_struct(decoding_cfg, False)
decoding_cfg.strategy = "greedy"
decoding_cfg.greedy.max_symbols = 10
decoding_cfg.greedy.use_cuda_graph_decoder = False # incompatible with streaming
model.change_decoding_strategy(decoding_cfg)
model.eval()

# Streaming loop — feed 80ms PCM int16 frames
cache_last_channel, cache_last_time = None, None
chunk_samples = 16000 * 1600 // 1000 # 1600ms chunk

audio_chunk = torch.zeros(1, chunk_samples, device=device) # replace with real audio
audio_len = torch.tensor([chunk_samples], device=device)

with torch.no_grad():
 processed, processed_len = model.preprocessor(
 input_signal=audio_chunk, length=audio_len
 )
 encoded, encoded_len, cache_last_channel, cache_last_time, _ = (
 model.encoder.cache_aware_stream_step(
 processed_signal=processed,
 processed_signal_length=processed_len,
 cache_last_channel=cache_last_channel,
 cache_last_time=cache_last_time,
 keep_all_outputs=False,
 )
 )

For a complete streaming implementation with microphone input, silence detection, word callbacks, and a FastAPI WebSocket server, see the companion script in the repository files.

Qualitative Examples

The following are exact reference vs. predicted outputs from the validation set — the model transcribed these word-for-word correctly, including full diacritisation (tashkeel):

Reference	Predicted
وَهُوَ الَّذِي جَعَلَ لَكُمُ اللَّيْلَ لِبَاسًا وَالنَّوْمَ سُبَاتًا وَجَعَلَ النَّهَارَ نُشُورًا	✅ Perfect
الزَّانِي لَا يَنْكِحُ إِلَّا زَانِيَةً أَوْ مُشْرِكَةً وَالزَّانِيَةُ لَا يَنْكِحُهَا إِلَّا زَانٍ أَوْ مُشْرِكٌ وَحُرِّمَ ذَلِكَ عَلَى الْمُؤْمِنِينَ	✅ Perfect
إِلَّا مَنْ تَابَ وَآمَنَ وَعَمِلَ عَمَلًا صَالِحًا فَأُولَئِكَ يُبَدِّلُ اللَّهُ سَيِّئَاتِهِمْ حَسَنَاتٍ وَكَانَ اللَّهُ غَفُورًا رَحِيمًا	✅ Perfect
إِذْ قَالَ لِأَبِيهِ وَقَوْمِهِ مَا تَعْبُدُونَ	✅ Perfect
يَوْمَ لَا يَنْفَعُ مَالٌ وَلَا بَنُونَ	✅ Perfect
إِذْ قَالَ لَهُمْ أَخُوهُمْ هُودٌ أَلَا تَتَّقُونَ	✅ Perfect
أَتَبْنُونَ بِكُلِّ رِيعٍ آيَةً تَعْبَثُونَ	✅ Perfect
فَنَجَّيْنَاهُ وَأَهْلَهُ أَجْمَعِينَ	✅ Perfect
فَقَرَأَهُ عَلَيْهِمْ مَا كَانُوا بِهِ مُؤْمِنِينَ	✅ Perfect
وَأَنْذِرْ عَشِيرَتَكَ الْأَقْرَبِينَ	✅ Perfect
الَّذِينَ يُقِيمُونَ الصَّلَاةَ وَيُؤْتُونَ الزَّكَاةَ وَهُمْ بِالْآخِرَةِ هُمْ يُوقِنُونَ	✅ Perfect

These span multiple surahs (Al-Furqan, An-Nur, Ash-Shu'ara, As-Saffat) and include some of the most phonetically demanding ayahs in the Quran — long compound sentences, rare vocabulary (نُشُورًا، سُبَاتًا), emphatic consonants, and precise tashkeel on every word.

Intended Use & Limitations

Intended use:

Quranic recitation transcription and verification
Tajweed learning applications
Ayah identification from audio
Recitation correction apps (compare hypothesis against reference ayah)

Limitations:

Optimised specifically for Quranic Arabic — performance on Modern Standard Arabic or dialectal Arabic will be significantly lower than the base model
Best results on clean, single-speaker recitation audio at 16kHz
The streaming mode introduces ~1.6s of latency per chunk due to the encoder's minimum chunk size requirement

Citation

If you use this model, please cite the base model and dataset:

@misc{fastconformer-quran-ar,
 author = {Mohammed},
 title = {FastConformer Quran Arabic ASR},
 year = {2026},
 publisher = {Hugging Face},
 url = {https://huggingface.co/mohammed/fastconformer-quran-ar}
}

@misc{everyayah,
 author = {Tarteel AI},
 title = {EveryAyah: A Quranic Recitation Dataset},
 publisher = {Hugging Face},
 url = {https://huggingface.co/datasets/tarteel-ai/everyayah}
}

URL: https://huggingface.co/mohammed/fastconformer-quran-ar

⇱ mohammed/fastconformer-quran-ar · Hugging Face