FastConformer-Quran — Streaming (CoreML / Apple Neural Engine)
Real-time, on-device streaming Quranic recitation ASR for iOS & macOS. Cache-aware FastConformer-Hybrid (CTC), fp16, runs on the Apple Neural Engine at a few milliseconds per chunk — built for live recitation tracking (word highlighting, follow-along, real-time feedback).
This is the streaming member of the FastConformer-Quran family. For maximum-accuracy full-utterance
transcription see the offline CoreML repo;
for the source model / ONNX / .nemo, see Muno459/fastconformer-quran.
- Riwayah: Hafs only — not a general Arabic ASR.
- Output: Arabic with full tashkīl (diacritics).
- Architecture: cache-aware FastConformer-Hybrid, CTC head,
att_context_size = [70, 13](~1.04 s lookahead), fixed chunk of 112 mel frames (1120 ms).
✅ Verified on the Apple Neural Engine
Measured on-device (Apple Silicon, MLComputeUnits.cpuAndNeuralEngine):
| Metric | Result |
|---|---|
| Decoding | 11 / 11 Al-Fātiḥah + Al-Ikhlās ayāt correct (incl. the Basmala), 0 NaN |
| ANE residency | ~99% — 1094 ops on ANE / 9 on CPU (no silent GPU/CPU fallback) |
| Latency | 5–8 ms per 1120 ms chunk (real-time, large margin) |
The chunked, limited-context attention bounds fp16 accumulation by design, so CTC margins stay safely positive in fp16 on the ANE.
Scope: the on-device check above is on 11 clean EveryAyah test ayāt — strong evidence the design holds. A broad multi-reciter WER sweep on-device is future work; a few frames sit on a thin positive margin (~0.01–0.46 nats), so a very noisy input could surface an edge case.
Accuracy (held-out WER / CER %)
Evaluated on a leakage-free held-out set (EveryAyah reciters never used in training + a held-out QUL reciter + real phone-recorded recitation), CTC greedy, alef-insensitive in parentheses:
| Test set | Streaming WER | CER |
|---|---|---|
| EveryAyah (held-out reciters, clean studio) | 6.3 (6.0) | 2.2 |
| QUL — Al-Nufais (held-out reciter, clean) | 11.6 (11.2) | 6.7 |
| Real phone recitation (tlog) | 19.6 (14.3) | 7.1 |
| All | 9.8 (8.6) | 4.0 |
Streaming trades some accuracy for low-latency, state-carrying inference. If you need the lowest WER and latency isn't critical, the offline variant scores ~3% WER on the same clips.
Files
| File | Purpose | Size |
|---|---|---|
fastconformer-quran-streaming.mlpackage |
Cache-aware streaming encoder + CTC head | ~204 MB |
pronunciation-head.mlpackage |
Per-token pronunciation scorer (streaming-matched) | ~5 MB |
tokenizer.model / tokens.txt |
SentencePiece BPE (1024 pieces + blank id 1024) | — |
Streaming model I/O (fixed shapes, fp16)
Inputs
| Name | Shape | dtype |
|---|---|---|
audio_signal |
(1, 80, 112) |
float16 |
cache_last_channel |
(1, 17, 70, 512) |
float16 |
cache_last_time |
(1, 17, 512, 8) |
float16 |
cache_last_channel_len |
(1,) |
int32 |
Outputs
| Name | Shape |
|---|---|
logprobs |
(1, 13, 1025) |
encoder_output |
(1, 512, 13) |
cache_last_channel_next |
(1, 17, 70, 512) |
cache_last_time_next |
(1, 17, 512, 8) |
cache_last_channel_len_next |
(1,) |
All shapes are concrete (no dynamic axes, no length input), so the Neural Engine pre-compiles one
kernel and runs it without fallback. Feed each chunk, carry the three *_next cache tensors into the
next call.
Quick start (Swift)
import CoreML
let cfg = MLModelConfiguration()
cfg.computeUnits = .cpuAndNeuralEngine
let model = try FastConformerQuranStreaming(configuration: cfg)
// Empty caches — shapes must match the spec exactly.
var cacheLC = try MLMultiArray(shape: [1, 17, 70, 512], dataType: .float16) // attention cache
var cacheLT = try MLMultiArray(shape: [1, 17, 512, 8], dataType: .float16) // conv cache
var cacheLen = try MLMultiArray(shape: [1], dataType: .int32); cacheLen[0] = 0
zero(cacheLC); zero(cacheLT)
// Fixed chunk: 112 mel frames = 1120 ms = 17,920 samples @ 16 kHz.
let CHUNK_SAMPLES = 112 * 160
var buffer = [Float](), transcript = ""
func feed(_ samples: [Float]) throws {
buffer.append(contentsOf: samples)
while buffer.count >= CHUNK_SAMPLES {
let chunk = Array(buffer.prefix(CHUNK_SAMPLES)); buffer.removeFirst(CHUNK_SAMPLES)
let feats = computeLogMel(chunk) // (1, 80, 112) Float16
let out = try model.prediction(audio_signal: feats,
cache_last_channel: cacheLC,
cache_last_time: cacheLT,
cache_last_channel_len: cacheLen)
cacheLC = out.cache_last_channel_next
cacheLT = out.cache_last_time_next
cacheLen = out.cache_last_channel_len_next
transcript += sentencePieceDecode(ctcCollapse(out.logprobs))
}
}
Feature extraction (must match exactly)
80-channel log-mel, identical to NeMo FilterbankFeatures:
- 16 kHz, mono
- window 25 ms (400 samples), Hann · hop 10 ms (160 samples) · 512-pt FFT
- 80 mel bins (Slaney), power spectrum,
log(mel + 1e-5) - pre-emphasis 0.97, then per-feature mean/var normalization
Python reference: tajweed/aligner.py.
~200 lines in Swift with Accelerate for the FFT.
Decoding
- Argmax
logprobsper frame → token IDs. - CTC collapse: drop blanks (id 1024) and dedupe consecutive identical IDs.
- SentencePiece-decode (
tokenizer.model) → Arabic text. Append across chunks for a rolling transcript.
Pronunciation head (optional)
pronunciation-head.mlpackage is trained on features pooled from this streaming encoder (so its
input distribution matches what the model emits on-device). Inputs: pooled encoder_output per token
(512-d) + token ID → prob_correct (P the token was pronounced correctly). All ops are ANE-friendly;
sigmoid-bounded, no fp16 concerns.
Precision note
fp16 throughout (no int8/int4 — the ANE is natively fp16). The RelPositionalEncoding xscale multiply
(×√512) can exceed the fp16 max on large activations, so it is computed then saturated to ±65 504
((x·√512).clamp(±65504)) — exactly the ANE's own behaviour, so it's a no-op on-device yet prevents
inf→NaN if an op is ever evicted off-ANE. Baked into the graph as a single clip op.
License
Apache 2.0 — same as NVIDIA FastConformer-Hybrid and the upstream FastConformer-Quran model.
Citation
@misc{fastconformer_quran_coreml_streaming_2026,
title = {FastConformer-Quran (Streaming, CoreML): on-device Quranic ASR for Apple Neural Engine},
year = {2026},
url = {https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming}
}
Benchmark
Leakage-free held-out WER vs nvidia / whisper / seamless / mms / omniASR / Tarteel: Quranic ASR Leaderboard.
- Downloads last month
- 15
Model tree for Muno459/fastconformer-quran-coreml-streaming
Base model
Muno459/fastconformer-quran-streaming