VOOZH about

URL: https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming

⇱ Muno459/fastconformer-quran-coreml-streaming · Hugging Face


You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FastConformer-Quran — Streaming (CoreML / Apple Neural Engine)

Real-time, on-device streaming Quranic recitation ASR for iOS & macOS. Cache-aware FastConformer-Hybrid (CTC), fp16, runs on the Apple Neural Engine at a few milliseconds per chunk — built for live recitation tracking (word highlighting, follow-along, real-time feedback).

This is the streaming member of the FastConformer-Quran family. For maximum-accuracy full-utterance transcription see the offline CoreML repo; for the source model / ONNX / .nemo, see Muno459/fastconformer-quran.

  • Riwayah: Hafs only — not a general Arabic ASR.
  • Output: Arabic with full tashkīl (diacritics).
  • Architecture: cache-aware FastConformer-Hybrid, CTC head, att_context_size = [70, 13] (~1.04 s lookahead), fixed chunk of 112 mel frames (1120 ms).

✅ Verified on the Apple Neural Engine

Measured on-device (Apple Silicon, MLComputeUnits.cpuAndNeuralEngine):

Metric Result
Decoding 11 / 11 Al-Fātiḥah + Al-Ikhlās ayāt correct (incl. the Basmala), 0 NaN
ANE residency ~99% — 1094 ops on ANE / 9 on CPU (no silent GPU/CPU fallback)
Latency 5–8 ms per 1120 ms chunk (real-time, large margin)

The chunked, limited-context attention bounds fp16 accumulation by design, so CTC margins stay safely positive in fp16 on the ANE.

Scope: the on-device check above is on 11 clean EveryAyah test ayāt — strong evidence the design holds. A broad multi-reciter WER sweep on-device is future work; a few frames sit on a thin positive margin (~0.01–0.46 nats), so a very noisy input could surface an edge case.

Accuracy (held-out WER / CER %)

Evaluated on a leakage-free held-out set (EveryAyah reciters never used in training + a held-out QUL reciter + real phone-recorded recitation), CTC greedy, alef-insensitive in parentheses:

Test set Streaming WER CER
EveryAyah (held-out reciters, clean studio) 6.3 (6.0) 2.2
QUL — Al-Nufais (held-out reciter, clean) 11.6 (11.2) 6.7
Real phone recitation (tlog) 19.6 (14.3) 7.1
All 9.8 (8.6) 4.0

Streaming trades some accuracy for low-latency, state-carrying inference. If you need the lowest WER and latency isn't critical, the offline variant scores ~3% WER on the same clips.


Files

File Purpose Size
fastconformer-quran-streaming.mlpackage Cache-aware streaming encoder + CTC head ~204 MB
pronunciation-head.mlpackage Per-token pronunciation scorer (streaming-matched) ~5 MB
tokenizer.model / tokens.txt SentencePiece BPE (1024 pieces + blank id 1024)

Streaming model I/O (fixed shapes, fp16)

Inputs

Name Shape dtype
audio_signal (1, 80, 112) float16
cache_last_channel (1, 17, 70, 512) float16
cache_last_time (1, 17, 512, 8) float16
cache_last_channel_len (1,) int32

Outputs

Name Shape
logprobs (1, 13, 1025)
encoder_output (1, 512, 13)
cache_last_channel_next (1, 17, 70, 512)
cache_last_time_next (1, 17, 512, 8)
cache_last_channel_len_next (1,)

All shapes are concrete (no dynamic axes, no length input), so the Neural Engine pre-compiles one kernel and runs it without fallback. Feed each chunk, carry the three *_next cache tensors into the next call.


Quick start (Swift)

import CoreML

let cfg = MLModelConfiguration()
cfg.computeUnits = .cpuAndNeuralEngine
let model = try FastConformerQuranStreaming(configuration: cfg)

// Empty caches — shapes must match the spec exactly.
var cacheLC = try MLMultiArray(shape: [1, 17, 70, 512], dataType: .float16) // attention cache
var cacheLT = try MLMultiArray(shape: [1, 17, 512, 8], dataType: .float16) // conv cache
var cacheLen = try MLMultiArray(shape: [1], dataType: .int32); cacheLen[0] = 0
zero(cacheLC); zero(cacheLT)

// Fixed chunk: 112 mel frames = 1120 ms = 17,920 samples @ 16 kHz.
let CHUNK_SAMPLES = 112 * 160
var buffer = [Float](), transcript = ""

func feed(_ samples: [Float]) throws {
 buffer.append(contentsOf: samples)
 while buffer.count >= CHUNK_SAMPLES {
 let chunk = Array(buffer.prefix(CHUNK_SAMPLES)); buffer.removeFirst(CHUNK_SAMPLES)
 let feats = computeLogMel(chunk) // (1, 80, 112) Float16
 let out = try model.prediction(audio_signal: feats,
 cache_last_channel: cacheLC,
 cache_last_time: cacheLT,
 cache_last_channel_len: cacheLen)
 cacheLC = out.cache_last_channel_next
 cacheLT = out.cache_last_time_next
 cacheLen = out.cache_last_channel_len_next
 transcript += sentencePieceDecode(ctcCollapse(out.logprobs))
 }
}

Feature extraction (must match exactly)

80-channel log-mel, identical to NeMo FilterbankFeatures:

  • 16 kHz, mono
  • window 25 ms (400 samples), Hann · hop 10 ms (160 samples) · 512-pt FFT
  • 80 mel bins (Slaney), power spectrum, log(mel + 1e-5)
  • pre-emphasis 0.97, then per-feature mean/var normalization

Python reference: tajweed/aligner.py. ~200 lines in Swift with Accelerate for the FFT.

Decoding

  1. Argmax logprobs per frame → token IDs.
  2. CTC collapse: drop blanks (id 1024) and dedupe consecutive identical IDs.
  3. SentencePiece-decode (tokenizer.model) → Arabic text. Append across chunks for a rolling transcript.

Pronunciation head (optional)

pronunciation-head.mlpackage is trained on features pooled from this streaming encoder (so its input distribution matches what the model emits on-device). Inputs: pooled encoder_output per token (512-d) + token ID → prob_correct (P the token was pronounced correctly). All ops are ANE-friendly; sigmoid-bounded, no fp16 concerns.

Precision note

fp16 throughout (no int8/int4 — the ANE is natively fp16). The RelPositionalEncoding xscale multiply (×√512) can exceed the fp16 max on large activations, so it is computed then saturated to ±65 504 ((x·√512).clamp(±65504)) — exactly the ANE's own behaviour, so it's a no-op on-device yet prevents inf→NaN if an op is ever evicted off-ANE. Baked into the graph as a single clip op.

License

Apache 2.0 — same as NVIDIA FastConformer-Hybrid and the upstream FastConformer-Quran model.

Citation

@misc{fastconformer_quran_coreml_streaming_2026,
 title = {FastConformer-Quran (Streaming, CoreML): on-device Quranic ASR for Apple Neural Engine},
 year = {2026},
 url = {https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming}
}

Benchmark

Leakage-free held-out WER vs nvidia / whisper / seamless / mms / omniASR / Tarteel: Quranic ASR Leaderboard.

Downloads last month
15

Model tree for Muno459/fastconformer-quran-coreml-streaming

Quantized
(1)
this model

Datasets used to train Muno459/fastconformer-quran-coreml-streaming