Quran STT — ONNX Exports

ONNX-format exports of Muno459/fastconformer-quran — a FastConformer CTC model for automatic speech recognition of Quranic recitation with Tajweed diacritics. Achieves 0.029% WER on EveryAyah.

How the Model Works

Audio (16 kHz mono)
 → Log-Mel Spectrogram [80 bands, 10 ms frames]
 → FastConformer Encoder (114.6M params, 8× temporal subsampling)
 → Linear Projection → LogSoftmax → [T_out × 1025] log-probs
 → CTC Greedy Decode (argmax + blank collapse) → token IDs

Key details:

Encoder: FastConformer Large — convolution-augmented Transformer. Each output frame covers 80 ms of audio.
CTC Head: Linear projection + LogSoftmax over 1025 classes (1024 SentencePiece BPE tokens + 1 blank).
Decoder: CTC greedy decoding — frame-independent, no language model, preserves mispronunciations.
Tokenizer: SentencePiece BPE, vocabulary size 1024, trained on Quranic Arabic with Tajweed diacritics.
Fine-tuned from NVIDIA's stt_en_fastconformer_hybrid_large_pc on EveryAyah + tlog.

Files

Path	Description
`onnx/model_fp32.onnx`	CTC-only ONNX, float32 (437 MB)
`onnx/model_fp16.onnx`	CTC-only ONNX, float16 (219 MB)
`onnx/model_int8.onnx`	CTC-only ONNX, int8 quantized (167 MB)
`tokenizer.model`	SentencePiece Unigram model (vocab=1024) — natively used by NeMo; works with Python `sentencepiece` or `@sctg/sentencepiece-js` (WASM) for web
`tokenizer.json`	HuggingFace tokenizer format — for `@huggingface/tokenizers` / `transformers.js`
`tokens.txt`	Token ID to text mapping
`model_config.yaml`	NeMo model configuration
`head/pronunciation_head.pt`	Pronunciation scoring head (5.2 MB)
`tajweed/`	Python modules: aligner, scorer, rules, phonology
`demo/`	Sample clips (Alafasy, Basfar) with expected transcriptions

Full Pipeline (as used in hifz-test)

The hifz-test project uses this model to evaluate Quran recitation — given a user's recording and a reference ayah, it produces per-word verdicts (correct, warning, wrong) with GOP pronunciation scores.

Here is the complete pipeline:

import numpy as np
import onnxruntime as ort
import sentencepiece as spm
import soundfile as sf
import librosa
import re

# ── 1. Load model & tokenizer ──────────────────────────────────
session = ort.InferenceSession("onnx/model_int8.onnx")
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
BLANK_ID = 1024
OUTPUT_HOP_S = 0.080 # 80 ms per output frame

# ── 2. Load audio (16 kHz mono) ────────────────────────────────
wav, sr = librosa.load("user_recording.wav", sr=16000, mono=True)

# ── 3. Extract log-mel features ────────────────────────────────
# See tajweed/aligner.py or hifz-test/mel.js for full implementation
def log_mel_extract(audio, sr=16000):
 n_fft, win_len, hop_len, n_mels = 512, 400, 160, 80
 window = np.hanning(win_len)
 pad = np.pad(audio, (0, win_len - 1))
 frames = 1 + (len(pad) - win_len) // hop_len
 stft = np.zeros((n_fft // 2 + 1, frames), dtype=np.complex64)
 for t in range(frames):
 s = pad[t * hop_len:t * hop_len + win_len] * window
 stft[:, t] = np.fft.rfft(s, n=n_fft)
 power = np.abs(stft) ** 2
 mel_pts = np.linspace(0, 2595 * np.log10(1 + 8000 / 700), n_mels + 2)
 hz_pts = 700 * (10 ** (mel_pts / 2595) - 1)
 bins = np.floor((n_fft + 1) * hz_pts / sr).astype(int)
 fb = np.zeros((n_mels, n_fft // 2 + 1))
 for m in range(1, n_mels + 1):
 for k in range(bins[m - 1], bins[m]):
 fb[m - 1, k] = (k - bins[m - 1]) / (bins[m] - bins[m - 1])
 for k in range(bins[m], bins[m + 1]):
 fb[m - 1, k] = (bins[m + 1] - k) / (bins[m + 1] - bins[m])
 mel = np.log(fb @ power + 2 ** -24)
 mel = (mel - mel.mean(axis=1, keepdims=True)) / (mel.std(axis=1, keepdims=True) + 1e-5)
 return mel.astype(np.float32) # (80, T)

features = log_mel_extract(wav)
features = features[None, ...] # (1, 80, T)
length = np.array([features.shape[2]], dtype=np.int64)

# ── 4. ONNX Inference ──────────────────────────────────────────
logprobs = session.run(["logprobs"], {"audio_signal": features, "length": length})[0][0]
# logprobs shape: (T_out, 1025) where T_out ≈ T_feat / 8

# ── 5. Greedy CTC decode ───────────────────────────────────────
decoded = []
prev = BLANK_ID
for t in range(logprobs.shape[0]):
 curr = int(np.argmax(logprobs[t]))
 if curr != prev and curr != BLANK_ID:
 decoded.append(curr)
 prev = curr

asr_text = sp.decode_ids(decoded)
print("ASR output:", asr_text)

# ── 6. Reference ayah text ─────────────────────────────────────
ref_text = "قُلْ هُوَ اللَّهُ أَحَدٌ" # Q 112:1
ref_words = re.sub(r"[^؀-ۿ\s]", "", ref_text).split()

# Normalize reference text to match tokenizer expectations
CLEAN_TABLE = str.maketrans({chr(c): '' for c in range(0x064B, 0x0653)})
ref_clean = ref_text.translate(CLEAN_TABLE)
ref_ids = sp.encode(ref_clean, out_type=int)

# ── 7. Fitting alignment (Needleman-Wunsch with free prefix gap) ──
# Finds where decoded tokens best match reference tokens
def fitting_align(decoded, reference):
 n, m = len(decoded), len(reference)
 dp = [[0] * (m + 1) for _ in range(n + 1)]
 for i in range(n + 1): dp[i][0] = i * -1
 for j in range(m + 1): dp[0][j] = 0
 for i in range(1, n + 1):
 for j in range(1, m + 1):
 s = 2 if decoded[i - 1] == reference[j - 1] else -1
 dp[i][j] = max(dp[i - 1][j - 1] + s, dp[i - 1][j] - 1, dp[i][j - 1] - 1)
 best_j = max(range(1, m + 1), key=lambda j: dp[n][j])
 i, j = n, best_j
 ref_indices = []
 while i > 0:
 s = 2 if j > 0 and decoded[i - 1] == reference[j - 1] else -1
 if j > 0 and dp[i][j] == dp[i - 1][j - 1] + s:
 if s == 2: ref_indices.append(j - 1)
 i -= 1; j -= 1
 elif dp[i][j] == dp[i - 1][j] - 1:
 i -= 1
 elif j > 0 and dp[i][j] == dp[i][j - 1] - 1:
 j -= 1
 else:
 i -= 1
 if not ref_indices: return 0, -1
 ref_indices.reverse()
 return ref_indices[0], ref_indices[-1]

first_match, last_match = fitting_align(decoded, ref_ids)

if first_match >= 0:
 # Snap to word boundaries (SentencePiece ▁ prefix)
 while first_match > 0 and not sp.id_to_piece(ref_ids[first_match - 1]).startswith("▁"):
 first_match -= 1
 while last_match + 1 < len(ref_ids) and not sp.id_to_piece(ref_ids[last_match + 1]).startswith("▁"):
 last_match += 1
 token_ids = ref_ids[first_match:last_match + 1]
else:
 token_ids = ref_ids # fallback to full reference

# ── 8. Viterbi CTC Forced Alignment ────────────────────────────
# Aligns each reference token to specific output frames
def ctc_forced_align(logprobs, token_ids, blank_id=1024):
 T, V = logprobs.shape
 seq = [blank_id]
 for t in token_ids:
 seq.append(t); seq.append(blank_id)
 S = len(seq)
 neg_inf = -1e18
 alpha = np.full((T, S), neg_inf)
 back = np.zeros((T, S), dtype=np.int16)
 alpha[0, 0] = logprobs[0, seq[0]]
 if S > 1: alpha[0, 1] = logprobs[0, seq[1]]
 skip_ok = np.zeros(S, dtype=bool)
 for s in range(2, S):
 skip_ok[s] = (seq[s] != blank_id) and (seq[s] != seq[s - 2])
 for t in range(1, T):
 for s in range(S):
 v0 = alpha[t - 1, s]
 v1 = alpha[t - 1, s - 1] if s > 0 else neg_inf
 v2 = alpha[t - 1, s - 2] if s >= 2 and skip_ok[s] else neg_inf
 best = max(enumerate([v0, v1, v2]), key=lambda x: x[1])
 alpha[t, s] = best[1] + logprobs[t, seq[s]]
 back[t, s] = -best[0]
 s = S - 1 if S < 2 or alpha[T - 1, S - 2] < alpha[T - 1, S - 1] else S - 2
 path = [s]
 for t in range(T - 1, 0, -1):
 s = s + back[t, s]
 path.append(s)
 path.reverse()
 intervals = []
 cur_start = -1
 for t, s in enumerate(path):
 if seq[s] == blank_id: continue
 tok_idx = (s - 1) // 2
 if tok_idx != cur_start:
 if cur_start >= 0: intervals.append((cur_start, t))
 cur_start = tok_idx
 if cur_start >= 0: intervals.append((cur_start, T))
 while len(intervals) < len(token_ids):
 intervals.append((intervals[-1][1], intervals[-1][1]) if intervals else (0, 0))
 return intervals[:len(token_ids)]

intervals = ctc_forced_align(logprobs, token_ids)

# ── 9. Segment into words & score each word ────────────────────
word_tokens = []
current = []
for tok_id, (a, b) in zip(token_ids, intervals):
 piece = sp.id_to_piece(tok_id)
 if current and piece.startswith("▁"):
 word_tokens.append(current)
 current = []
 current.append((tok_id, a, b))
if current:
 word_tokens.append(current)

results = []
for i, tokens in enumerate(word_tokens):
 if i >= len(ref_words):
 break
 ref_word = ref_words[i]

 # Compute GOP per token
 gop_norms = []
 detected_pieces = []
 for tok_id, a, b in tokens:
 window = logprobs[max(0, a):b] if b > a else logprobs[a:a + 1]
 if len(window) == 0: continue
 expected = float(window[:, tok_id].max())
 frame = window[int(np.argmax(window[:, tok_id]))].copy()
 frame[BLANK_ID] = -np.inf
 top_lp = float(frame.max()) if frame.size > 0 else -100.0
 gop_norms.append(expected - top_lp)
 detected_pieces.append(sp.id_to_piece(int(np.argmax(frame))).lstrip("▁"))

 gop_avg = float(np.mean(gop_norms)) if gop_norms else -100.0
 gop_min = float(np.min(gop_norms)) if gop_norms else -100.0
 detected_text = "".join(detected_pieces)

 # Classify
 harakat = re.compile(r"[\u064B-\u0652]")
 sim = (lambda s1, s2: sum(1 for a, b in zip(s1, s2) if a == b) / max(len(s1), len(s2))
 if s1 and s2 else 0.0)(
 harakat.sub("", ref_word), harakat.sub("", detected_text))

 if not detected_text or sim < 0.70:
 status = "wrong"
 elif gop_min < -2.0:
 status = "warning"
 else:
 status = "correct"

 results.append({
 "reference": ref_word,
 "detected": detected_text,
 "start_s": round(tokens[0][1] * OUTPUT_HOP_S, 3),
 "end_s": round(tokens[-1][2] * OUTPUT_HOP_S, 3),
 "gop_avg": round(gop_avg, 3),
 "gop_min": round(gop_min, 3),
 "status": status,
 })

# ── 10. Results ────────────────────────────────────────────────
summary = {
 "correct": sum(1 for r in results if r["status"] == "correct"),
 "warning": sum(1 for r in results if r["status"] == "warning"),
 "wrong": sum(1 for r in results if r["status"] == "wrong"),
}
print(f"\nSummary: {summary['correct']}/{len(results)} correct, "
 f"{summary['warning']} warnings, {summary['wrong']} wrong\n")
for r in results:
 print(f" {r['reference']:<20} → {r['detected']:<20} "
 f"gop={r['gop_min']:.2f} {r['status']}")

This mirrors the exact pipeline in hifz_pipeline.py:HifzScorer.run() (source).

Simplified Usage

Basic ASR

import numpy as np
import onnxruntime as ort
import sentencepiece as spm
import librosa

session = ort.InferenceSession("onnx/model_fp16.onnx")
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

wav, _ = librosa.load("clip.wav", sr=16000, mono=True)

# Extract features (see tajweed/aligner.py for the exact implementation)
features = log_mel_extract(wav)[None, ...].astype(np.float32)

logprobs = session.run(["logprobs"], {
 "audio_signal": features,
 "length": np.array([features.shape[2]], dtype=np.int64),
})[0][0] # (T_out, 1025)

# Greedy decode
ids = []
prev = 1024
for t in range(logprobs.shape[0]):
 best = logprobs[t].argmax()
 if best != 1024 and best != prev:
 ids.append(int(best))
 prev = int(best)

print(sp.decode_ids(ids))

Using the tajweed aligner

from tajweed.aligner import CTCAligner

aligner = CTCAligner(
 model_path="onnx/model_fp16.onnx",
 tokenizer_path="tokenizer.model",
)
logprobs = aligner.transcribe(audio_waveform)
tokens = aligner.decode(logprobs)
print(tokens)

Web (dev mode)

import { Tokenizer } from 'https://cdn.jsdelivr.net/npm/@huggingface/tokenizers@0.1.3/+esm';

// Load tokenizer (8.3 kB gzip, pure JS, no WASM)
const tokRes = await fetch('https://huggingface.co/Saboorhsn/quran-stt-onnx/resolve/main/tokenizer.json');
const tokenizer = new Tokenizer(await tokRes.json(), {
 unk_token: '<unk>', bos_token: '<s>', eos_token: '</s>'
});

const encoded = tokenizer.encode('بِسْمِ اللَّهِ');
const decoded = tokenizer.decode(encoded.ids);

// Load ONNX model
const response = await fetch(
 'https://huggingface.co/Saboorhsn/quran-stt-onnx/resolve/main/onnx/model_fp16.onnx'
);
const session = await (await import('onnxruntime-web'))
 .InferenceSession.create(await response.arrayBuffer(), {
 executionProviders: ['wasm']
 });

For production Android, bundle model_int8.onnx in the APK.

Performance

Metric	Value
WER (loose, no diacritics)	2.9 %
CER (strict, with diacritics)	2.7 %
WER (strict, with diacritics)	17.5 %
WER (zero-shot, 30 unseen qaris)	23.0 %
RTF (Python, CPU)	0.04
RTF (WASM SIMD, x86)	0.22
RTF (Android native)	~0.04

Credits

Original model — Muno459/fastconformer-quran
Base architecture — NVIDIA FastConformer via NeMo
Pre-trained weights — nvidia/stt_en_fastconformer_hybrid_large_pc
Datasets — EveryAyah · tlog
ONNX export & quantization — Saboor Hsn · Mualim.app

Mualim-Quran — Quran tutoring app using this pipeline
hadith-api-toon — CDN-optimized Hadith API (68K hadiths, 8 languages)

License

Apache 2.0

Downloads last month: 258

Model tree for Saboorhsn/quran-stt-onnx

Base model

nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0

Quantized

Muno459/fastconformer-quran