bol-tts-marathi-onnx — ONNX export

ONNX-format export of the Marathi Kokoro-82M fine-tune at shreyask/bol-tts-marathi. Designed for WebGPU / transformers.js / onnxruntime deployments.

Live demo: shreyask/bol-tts-marathi (in-browser via WebGPU using this very ONNX file)
Write-up: kshreyas.dev/post/bol-tts-marathi
Code + export script: github.com/shreyaskarnik/bol-tts-marathi

Architecture: Kokoro-82M with disable_complex=True (uses CustomSTFT instead of TorchSTFT, which uses complex tensors that ONNX doesn't support).

Files

onnx/model.onnx — fp32 model, 326 MB
config.json — Kokoro inference config with ɭ at slot 144 (Marathi retroflex lateral)
voice_speeds.json — per-voice optimal default speed
voices/*.pt — 25 voicepack .pt files, [510, 1, 256] float32 each

Model I/O

Inputs:
 input_ids: int64 [1, n_phonemes] — phoneme token IDs (per config.json vocab).
 MUST be wrapped with BOS=0 and EOS=0:
 [0, *content_ids, 0]
 style: float32 [1, 256] — voicepack slice at position [content_n_phonemes].
 (Naming follows kokoro-js + thewh1teagle/kokoro-onnx
 ecosystem convention.)
 speed: float32 [1] — pacing multiplier (1.0 = neutral; <1.0 slows, >1.0 fastens).
 Divides the predictor's per-phoneme duration BEFORE
 rounding, so it scales actual frame allocation —
 not just playback rate.

Outputs:
 audio: float32 [1, n_samples] — 24 kHz waveform. Includes BOS+EOS audio at start/end —
 strip `bos_frames * 600` samples from the front and
 `eos_frames * 600` from the back if you want
 content-only audio (Rasa-trained voicepacks generate
 a soft breathy pre-roll for BOS that surfaces as
 "umm" if not stripped).
 pred_dur: int64 [1, n_phonemes] — per-phoneme durations in predictor frames.
 1 frame = 600 audio samples at 24 kHz.
 pred_dur[0] = BOS duration; pred_dur[-1] = EOS.

pred_dur is exposed so downstream apps can build phoneme/word-level timestamps.

Usage — onnxruntime (Python)

import numpy as np
import onnxruntime as ort
import torch
import soundfile as sf
import json
from misaki import espeak

sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
vocab = json.load(open("config.json"))["vocab"]
voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True)

g2p = espeak.EspeakG2P(language="mr")
text = "नमस्कार, मी मराठी बोलतो."
phonemes, _ = g2p(text)
content_ids = [vocab[p] for p in phonemes if p in vocab]

# Wrap with BOS=0, EOS=0
input_ids = np.array([[0, *content_ids, 0]], dtype=np.int64)
# Voicepack indexed by CONTENT length (not wrapped length): [510, 1, 256] -> slot
style = voice[len(content_ids)].numpy().astype(np.float32)
speed = np.array([1.0], dtype=np.float32)

audio, pred_dur = sess.run(None, {
 "input_ids": input_ids,
 "style": style,
 "speed": speed,
})

# Strip BOS+EOS audio (optional but recommended; see I/O notes above)
HOP = 600
bos_frames = int(pred_dur.flatten()[0])
eos_frames = int(pred_dur.flatten()[-1])
audio = audio[bos_frames * HOP : len(audio) - eos_frames * HOP]

sf.write("out.wav", audio, 24000)

Usage — WebGPU / transformers.js

The live demo at shreyask/bol-tts-marathi uses this exact ONNX file via @huggingface/transformers. The TS client calls await model({ input_ids, style, speed }) and applies the BOS/EOS strip + per-utterance silence injection at punctuation boundaries client-side. Source: Space's src/model.ts.

For Marathi support in upstream Kokoro-JS pipelines, you'll need to monkey-patch 'm' as a Marathi lang_code (espeak 'mr').

Voicepacks (25)

This repo ships all 25 voicepacks deployed in the live demo as .pt files (use them as style input):

4 trained on Marathi corpora: mf_asha, mm_vivek (Rasa), mf_mukta, mm_dnyanesh (SPRINGLab)
19 stock-Kokoro crossovers: af_heart (Svara), af_nova (Tara), am_liam (Atharv), bf_emma-style (Ira), hm_omega (Vihaan), zf_xiaoxiao (Pari, kid), zf_xiaoyi (Vir, kid), … etc. See the demo's voicepacks.json for the full ID → display-name mapping.
2 synthetic: syn_sama (centroid mean of 5 voicepacks), syn_navya (centroid + Gaussian noise) — generated arithmetically with no reference audio.

Export details

Exported via scripts/export_onnx.py:

torch.onnx.export(
 KModelForONNX(kmodel), # upstream wrapper, runs forward_with_tokens
 (dummy_input_ids, dummy_style, dummy_speed),
 output_path,
 input_names=["input_ids", "style", "speed"],
 output_names=["audio", "pred_dur"],
 dynamic_axes={
 "input_ids": {1: "n_phonemes"},
 "audio": {1: "n_samples"},
 "pred_dur": {1: "n_phonemes"},
 },
 opset_version=17,
 dynamo=False, # legacy TorchScript tracer; pinned for torch ≤ 2.8
 do_constant_folding=True,
)

⚠️ torch ≤ 2.8 required for export. torch ≥ 2.9 silently emits a static-output ONNX with the legacy tracer (dynamo=False) on Kokoro's InstanceNorm-under-spectral-norm + LSTM + CustomSTFT combo. The exported file loads + runs in onnxruntime but produces silence. We pin torch==2.6 in our export venv. See bol-tts-marathi pyproject.toml for the constraint.

disable_complex=True is mandatory — Kokoro's default TorchSTFT uses complex tensors that ONNX doesn't support.

License

Apache 2.0. See the base PyTorch model for full citation/attribution.

Downloads last month: 67

Model tree for shreyask/bol-tts-marathi-onnx

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Finetuned

shreyask/bol-tts-marathi

Quantized

(1)

this model

URL: https://huggingface.co/shreyask/bol-tts-marathi-onnx

⇱ shreyask/bol-tts-marathi-onnx · Hugging Face