Quran STT — ONNX Exports
ONNX-format exports of Muno459/fastconformer-quran — a FastConformer CTC model for automatic speech recognition of Quranic recitation with Tajweed diacritics. Achieves 0.029% WER on EveryAyah.
How the Model Works
Audio (16 kHz mono)
→ Log-Mel Spectrogram [80 bands, 10 ms frames]
→ FastConformer Encoder (114.6M params, 8× temporal subsampling)
→ Linear Projection → LogSoftmax → [T_out × 1025] log-probs
→ CTC Greedy Decode (argmax + blank collapse) → token IDs
Key details:
- Encoder: FastConformer Large — convolution-augmented Transformer. Each output frame covers 80 ms of audio.
- CTC Head: Linear projection + LogSoftmax over 1025 classes (1024 SentencePiece BPE tokens + 1 blank).
- Decoder: CTC greedy decoding — frame-independent, no language model, preserves mispronunciations.
- Tokenizer: SentencePiece BPE, vocabulary size 1024, trained on Quranic Arabic with Tajweed diacritics.
- Fine-tuned from NVIDIA's
stt_en_fastconformer_hybrid_large_pcon EveryAyah + tlog.
Files
| Path | Description |
|---|---|
onnx/model_fp32.onnx |
CTC-only ONNX, float32 (437 MB) |
onnx/model_fp16.onnx |
CTC-only ONNX, float16 (219 MB) |
onnx/model_int8.onnx |
CTC-only ONNX, int8 quantized (167 MB) |
tokenizer.model |
SentencePiece Unigram model (vocab=1024) — natively used by NeMo; works with Python sentencepiece or @sctg/sentencepiece-js (WASM) for web |
tokenizer.json |
HuggingFace tokenizer format — for @huggingface/tokenizers / transformers.js |
tokens.txt |
Token ID to text mapping |
model_config.yaml |
NeMo model configuration |
head/pronunciation_head.pt |
Pronunciation scoring head (5.2 MB) |
tajweed/ |
Python modules: aligner, scorer, rules, phonology |
demo/ |
Sample clips (Alafasy, Basfar) with expected transcriptions |
Full Pipeline (as used in hifz-test)
The hifz-test project uses this model to evaluate Quran recitation — given a user's recording and a reference ayah, it produces per-word verdicts (correct, warning, wrong) with GOP pronunciation scores.
Here is the complete pipeline:
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
import soundfile as sf
import librosa
import re
# ── 1. Load model & tokenizer ──────────────────────────────────
session = ort.InferenceSession("onnx/model_int8.onnx")
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
BLANK_ID = 1024
OUTPUT_HOP_S = 0.080 # 80 ms per output frame
# ── 2. Load audio (16 kHz mono) ────────────────────────────────
wav, sr = librosa.load("user_recording.wav", sr=16000, mono=True)
# ── 3. Extract log-mel features ────────────────────────────────
# See tajweed/aligner.py or hifz-test/mel.js for full implementation
def log_mel_extract(audio, sr=16000):
n_fft, win_len, hop_len, n_mels = 512, 400, 160, 80
window = np.hanning(win_len)
pad = np.pad(audio, (0, win_len - 1))
frames = 1 + (len(pad) - win_len) // hop_len
stft = np.zeros((n_fft // 2 + 1, frames), dtype=np.complex64)
for t in range(frames):
s = pad[t * hop_len:t * hop_len + win_len] * window
stft[:, t] = np.fft.rfft(s, n=n_fft)
power = np.abs(stft) ** 2
mel_pts = np.linspace(0, 2595 * np.log10(1 + 8000 / 700), n_mels + 2)
hz_pts = 700 * (10 ** (mel_pts / 2595) - 1)
bins = np.floor((n_fft + 1) * hz_pts / sr).astype(int)
fb = np.zeros((n_mels, n_fft // 2 + 1))
for m in range(1, n_mels + 1):
for k in range(bins[m - 1], bins[m]):
fb[m - 1, k] = (k - bins[m - 1]) / (bins[m] - bins[m - 1])
for k in range(bins[m], bins[m + 1]):
fb[m - 1, k] = (bins[m + 1] - k) / (bins[m + 1] - bins[m])
mel = np.log(fb @ power + 2 ** -24)
mel = (mel - mel.mean(axis=1, keepdims=True)) / (mel.std(axis=1, keepdims=True) + 1e-5)
return mel.astype(np.float32) # (80, T)
features = log_mel_extract(wav)
features = features[None, ...] # (1, 80, T)
length = np.array([features.shape[2]], dtype=np.int64)
# ── 4. ONNX Inference ──────────────────────────────────────────
logprobs = session.run(["logprobs"], {"audio_signal": features, "length": length})[0][0]
# logprobs shape: (T_out, 1025) where T_out ≈ T_feat / 8
# ── 5. Greedy CTC decode ───────────────────────────────────────
decoded = []
prev = BLANK_ID
for t in range(logprobs.shape[0]):
curr = int(np.argmax(logprobs[t]))
if curr != prev and curr != BLANK_ID:
decoded.append(curr)
prev = curr
asr_text = sp.decode_ids(decoded)
print("ASR output:", asr_text)
# ── 6. Reference ayah text ─────────────────────────────────────
ref_text = "قُلْ هُوَ اللَّهُ أَحَدٌ" # Q 112:1
ref_words = re.sub(r"[^-ۿ\s]", "", ref_text).split()
# Normalize reference text to match tokenizer expectations
CLEAN_TABLE = str.maketrans({chr(c): '' for c in range(0x064B, 0x0653)})
ref_clean = ref_text.translate(CLEAN_TABLE)
ref_ids = sp.encode(ref_clean, out_type=int)
# ── 7. Fitting alignment (Needleman-Wunsch with free prefix gap) ──
# Finds where decoded tokens best match reference tokens
def fitting_align(decoded, reference):
n, m = len(decoded), len(reference)
dp = [[0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1): dp[i][0] = i * -1
for j in range(m + 1): dp[0][j] = 0
for i in range(1, n + 1):
for j in range(1, m + 1):
s = 2 if decoded[i - 1] == reference[j - 1] else -1
dp[i][j] = max(dp[i - 1][j - 1] + s, dp[i - 1][j] - 1, dp[i][j - 1] - 1)
best_j = max(range(1, m + 1), key=lambda j: dp[n][j])
i, j = n, best_j
ref_indices = []
while i > 0:
s = 2 if j > 0 and decoded[i - 1] == reference[j - 1] else -1
if j > 0 and dp[i][j] == dp[i - 1][j - 1] + s:
if s == 2: ref_indices.append(j - 1)
i -= 1; j -= 1
elif dp[i][j] == dp[i - 1][j] - 1:
i -= 1
elif j > 0 and dp[i][j] == dp[i][j - 1] - 1:
j -= 1
else:
i -= 1
if not ref_indices: return 0, -1
ref_indices.reverse()
return ref_indices[0], ref_indices[-1]
first_match, last_match = fitting_align(decoded, ref_ids)
if first_match >= 0:
# Snap to word boundaries (SentencePiece ▁ prefix)
while first_match > 0 and not sp.id_to_piece(ref_ids[first_match - 1]).startswith("▁"):
first_match -= 1
while last_match + 1 < len(ref_ids) and not sp.id_to_piece(ref_ids[last_match + 1]).startswith("▁"):
last_match += 1
token_ids = ref_ids[first_match:last_match + 1]
else:
token_ids = ref_ids # fallback to full reference
# ── 8. Viterbi CTC Forced Alignment ────────────────────────────
# Aligns each reference token to specific output frames
def ctc_forced_align(logprobs, token_ids, blank_id=1024):
T, V = logprobs.shape
seq = [blank_id]
for t in token_ids:
seq.append(t); seq.append(blank_id)
S = len(seq)
neg_inf = -1e18
alpha = np.full((T, S), neg_inf)
back = np.zeros((T, S), dtype=np.int16)
alpha[0, 0] = logprobs[0, seq[0]]
if S > 1: alpha[0, 1] = logprobs[0, seq[1]]
skip_ok = np.zeros(S, dtype=bool)
for s in range(2, S):
skip_ok[s] = (seq[s] != blank_id) and (seq[s] != seq[s - 2])
for t in range(1, T):
for s in range(S):
v0 = alpha[t - 1, s]
v1 = alpha[t - 1, s - 1] if s > 0 else neg_inf
v2 = alpha[t - 1, s - 2] if s >= 2 and skip_ok[s] else neg_inf
best = max(enumerate([v0, v1, v2]), key=lambda x: x[1])
alpha[t, s] = best[1] + logprobs[t, seq[s]]
back[t, s] = -best[0]
s = S - 1 if S < 2 or alpha[T - 1, S - 2] < alpha[T - 1, S - 1] else S - 2
path = [s]
for t in range(T - 1, 0, -1):
s = s + back[t, s]
path.append(s)
path.reverse()
intervals = []
cur_start = -1
for t, s in enumerate(path):
if seq[s] == blank_id: continue
tok_idx = (s - 1) // 2
if tok_idx != cur_start:
if cur_start >= 0: intervals.append((cur_start, t))
cur_start = tok_idx
if cur_start >= 0: intervals.append((cur_start, T))
while len(intervals) < len(token_ids):
intervals.append((intervals[-1][1], intervals[-1][1]) if intervals else (0, 0))
return intervals[:len(token_ids)]
intervals = ctc_forced_align(logprobs, token_ids)
# ── 9. Segment into words & score each word ────────────────────
word_tokens = []
current = []
for tok_id, (a, b) in zip(token_ids, intervals):
piece = sp.id_to_piece(tok_id)
if current and piece.startswith("▁"):
word_tokens.append(current)
current = []
current.append((tok_id, a, b))
if current:
word_tokens.append(current)
results = []
for i, tokens in enumerate(word_tokens):
if i >= len(ref_words):
break
ref_word = ref_words[i]
# Compute GOP per token
gop_norms = []
detected_pieces = []
for tok_id, a, b in tokens:
window = logprobs[max(0, a):b] if b > a else logprobs[a:a + 1]
if len(window) == 0: continue
expected = float(window[:, tok_id].max())
frame = window[int(np.argmax(window[:, tok_id]))].copy()
frame[BLANK_ID] = -np.inf
top_lp = float(frame.max()) if frame.size > 0 else -100.0
gop_norms.append(expected - top_lp)
detected_pieces.append(sp.id_to_piece(int(np.argmax(frame))).lstrip("▁"))
gop_avg = float(np.mean(gop_norms)) if gop_norms else -100.0
gop_min = float(np.min(gop_norms)) if gop_norms else -100.0
detected_text = "".join(detected_pieces)
# Classify
harakat = re.compile(r"[\u064B-\u0652]")
sim = (lambda s1, s2: sum(1 for a, b in zip(s1, s2) if a == b) / max(len(s1), len(s2))
if s1 and s2 else 0.0)(
harakat.sub("", ref_word), harakat.sub("", detected_text))
if not detected_text or sim < 0.70:
status = "wrong"
elif gop_min < -2.0:
status = "warning"
else:
status = "correct"
results.append({
"reference": ref_word,
"detected": detected_text,
"start_s": round(tokens[0][1] * OUTPUT_HOP_S, 3),
"end_s": round(tokens[-1][2] * OUTPUT_HOP_S, 3),
"gop_avg": round(gop_avg, 3),
"gop_min": round(gop_min, 3),
"status": status,
})
# ── 10. Results ────────────────────────────────────────────────
summary = {
"correct": sum(1 for r in results if r["status"] == "correct"),
"warning": sum(1 for r in results if r["status"] == "warning"),
"wrong": sum(1 for r in results if r["status"] == "wrong"),
}
print(f"\nSummary: {summary['correct']}/{len(results)} correct, "
f"{summary['warning']} warnings, {summary['wrong']} wrong\n")
for r in results:
print(f" {r['reference']:<20} → {r['detected']:<20} "
f"gop={r['gop_min']:.2f} {r['status']}")
This mirrors the exact pipeline in hifz_pipeline.py:HifzScorer.run() (source).
Simplified Usage
Basic ASR
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
import librosa
session = ort.InferenceSession("onnx/model_fp16.onnx")
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
wav, _ = librosa.load("clip.wav", sr=16000, mono=True)
# Extract features (see tajweed/aligner.py for the exact implementation)
features = log_mel_extract(wav)[None, ...].astype(np.float32)
logprobs = session.run(["logprobs"], {
"audio_signal": features,
"length": np.array([features.shape[2]], dtype=np.int64),
})[0][0] # (T_out, 1025)
# Greedy decode
ids = []
prev = 1024
for t in range(logprobs.shape[0]):
best = logprobs[t].argmax()
if best != 1024 and best != prev:
ids.append(int(best))
prev = int(best)
print(sp.decode_ids(ids))
Using the tajweed aligner
from tajweed.aligner import CTCAligner
aligner = CTCAligner(
model_path="onnx/model_fp16.onnx",
tokenizer_path="tokenizer.model",
)
logprobs = aligner.transcribe(audio_waveform)
tokens = aligner.decode(logprobs)
print(tokens)
Web (dev mode)
import { Tokenizer } from 'https://cdn.jsdelivr.net/npm/@huggingface/tokenizers@0.1.3/+esm';
// Load tokenizer (8.3 kB gzip, pure JS, no WASM)
const tokRes = await fetch('https://huggingface.co/Saboorhsn/quran-stt-onnx/resolve/main/tokenizer.json');
const tokenizer = new Tokenizer(await tokRes.json(), {
unk_token: '<unk>', bos_token: '<s>', eos_token: '</s>'
});
const encoded = tokenizer.encode('بِسْمِ اللَّهِ');
const decoded = tokenizer.decode(encoded.ids);
// Load ONNX model
const response = await fetch(
'https://huggingface.co/Saboorhsn/quran-stt-onnx/resolve/main/onnx/model_fp16.onnx'
);
const session = await (await import('onnxruntime-web'))
.InferenceSession.create(await response.arrayBuffer(), {
executionProviders: ['wasm']
});
For production Android, bundle model_int8.onnx in the APK.
Performance
| Metric | Value |
|---|---|
| WER (loose, no diacritics) | 2.9 % |
| CER (strict, with diacritics) | 2.7 % |
| WER (strict, with diacritics) | 17.5 % |
| WER (zero-shot, 30 unseen qaris) | 23.0 % |
| RTF (Python, CPU) | 0.04 |
| RTF (WASM SIMD, x86) | 0.22 |
| RTF (Android native) | ~0.04 |
Credits
- Original model — Muno459/fastconformer-quran
- Base architecture — NVIDIA FastConformer via NeMo
- Pre-trained weights — nvidia/stt_en_fastconformer_hybrid_large_pc
- Datasets — EveryAyah · tlog
- ONNX export & quantization — Saboor Hsn · Mualim.app
Related
- Mualim-Quran — Quran tutoring app using this pipeline
- hadith-api-toon — CDN-optimized Hadith API (68K hadiths, 8 languages)
License
Apache 2.0
- Downloads last month
- 258
Model tree for Saboorhsn/quran-stt-onnx
Quantized
Muno459/fastconformer-quran