VOOZH about

URL: https://pypi.org/project/supertonic/

⇱ supertonic Β· PyPI


Skip to main content

supertonic 1.3.1

pip install supertonic

Latest release

Released:

High-quality Text-to-Speech synthesis with ONNX Runtime

Navigation

Verified details

These details have been verified by PyPI
Owner
Maintainers
πŸ‘ Avatar for ato_sup from gravatar.com
ato_sup

Unverified details

These details have not been verified by PyPI
Project links
Meta
  • License Expression: MIT
    SPDX License Expression
  • Author: Yu, Yechan
  • Maintainer: Supertone Inc.
  • Tags tts , text-to-speech , speech-synthesis , onnx , deep-learning , audio , voice-synthesis , neural-tts , on-device , tts-server , http-server , fastapi , openai-compatible
  • Requires: Python >=3.9
  • Provides-Extra: playback , serve , dev

Project description

Supertonic 3 β€” Lightning Fast, On-Device TTS

πŸ‘ Supertonic 3 Banner

πŸ‘ GitHub | Official Repo
πŸ‘ GitHub | Python Package
πŸ‘ Docs | Python PyPI
πŸ‘ DemoPage | Audio Samples
πŸ‘ Voice Builder | Cloning Demo
πŸ‘ Demo
πŸ‘ Models
πŸ‘ PyPI version

Supertonic-3: Multilingual synthesis across 31 languages, plus a na fallback for text whose language is unknown or outside the supported set.

Quick Start

pipinstallsupertonic

Python

Every parameter is annotated inline, so the snippet doubles as copy-and-paste documentation for an LLM assistant:

fromsupertonicimport TTS

# Note: first run downloads the model (~400MB) into ~/.cache/supertonic3/
tts = TTS(auto_download=True) # Initialize TTS engine

style = tts.get_voice_style(voice_name="M1") # 10 built-in voices: M1–M5, F1–F5

wav, duration = tts.synthesize(
 text="Supertonic is a lightning fast, on-device TTS system.",
 voice_style=style, # Voice style object
 total_steps=8, # Quality: 5 (low) to 12 (high), default 8
 speed=1.05, # Speed: 0.7 (slow) to 2.0 (fast)
 max_chunk_length=300, # Max characters per chunk (auto: 120 for Korean)
 silence_duration=0.3, # Silence between chunks (seconds)
 lang="en", # ISO code; see "Supported Languages" below
 verbose=False, # Show detailed progress (default: False)
)
tts.save_audio(wav, "output.wav")

# Multilingual β€” just swap `lang` and the input text
wav_ko, _ = tts.synthesize("νšŒμ˜λŠ” μž μ‹œ 후에 μ‹œμž‘λ˜λ©° λͺ¨λ‘κ°€ μžλ¦¬μ— 앉아 κΈ°λ‹€λ¦½λ‹ˆλ‹€.", voice_style=style, lang="ko")
wav_es, _ = tts.synthesize("La reuniΓ³n comienza pronto y todos se sientan en silencio para escuchar.", voice_style=style, lang="es")

Custom voices (Voice Builder)

get_voice_style() loads one of the ten built-in voices (M1–M5, F1–F5). To use a voice created in Voice Builder (zero-shot cloning from a short reference clip), pass its JSON export to get_voice_style_from_path():

# Any voice-style JSON works here:
# - a Voice Builder export, or
# - one of the bundled defaults at
# ~/.cache/supertonic3/voice_styles/{M1..M5,F1..F5}.json
# (downloaded alongside the model on first run)
# ex)
# style = tts.get_voice_style_from_path("~/.cache/supertonic3/voice_styles/M1.json")

# download a custom voice style from a JSON file (e.g., exported from Voice Builder)
style = tts.get_voice_style_from_path("voices/my_voice.json")
wav, _ = tts.synthesize("Hello in my own cloned voice.", voice_style=style, lang="en")
tts.save_audio(wav, "output_own_voice.wav")

CLI

# Note: first run will download the model (~400MB) from HuggingFace
supertonictts'Supertonic is a lightning fast, on-device TTS system.'-ooutput.wav

# Pick a built-in voice and bump quality
supertonictts'Use a different voice.'-ooutput.wav--voiceF1--steps10

# Use a custom voice β€” Voice Builder export, or a bundled
# ~/.cache/supertonic3/voice_styles/*.json file
supertonictts'Hello in my own cloned voice.'-ooutput.wav\
--custom-style-pathvoices/my_voice.json

# Multilingual support β€” each language with natural text handling
supertonictts'νšŒμ˜λŠ” μž μ‹œ 후에 μ‹œμž‘λ˜λ©° λͺ¨λ‘κ°€ μžλ¦¬μ— 앉아 κΈ°λ‹€λ¦½λ‹ˆλ‹€.'-okorean.wav--langko
supertonictts'La reuniΓ³n comienza pronto y todos se sientan en silencio para escuchar.'-ospanish.wav--langes
supertonictts'A reuniΓ£o comeΓ§a em breve e todos se sentam em silΓͺncio para ouvir.'-oportuguese.wav--langpt

Local Server (HTTP)

supertonic serve runs a thin local HTTP wrapper around the same engine. It makes Supertonic easy to call from environments where embedding a Python interpreter is awkward β€” n8n, browser extensions, Electron apps, Unity, Home Assistant, robotics devices, or anything that already speaks the OpenAI Audio Speech API.

Install and run

pipinstall'supertonic[serve]'# adds fastapi + uvicorn
supertonicserve--host127.0.0.1--port7788# defaults; loopback only

The first run downloads the model (~400MB) just like the SDK. Once it's up:

  • Synthesis endpoint: http://127.0.0.1:7788/v1/tts
  • OpenAI-compatible alias: http://127.0.0.1:7788/v1/audio/speech
  • Interactive OpenAPI docs: http://127.0.0.1:7788/docs

--host defaults to 127.0.0.1. Binding to any other interface is opt-in and prints a one-line warning β€” put it behind a reverse proxy if you do.

Generate audio (two ways)

Native /v1/tts β€” full Supertonic parameter set:

curl-XPOSThttp://127.0.0.1:7788/v1/tts\
-H'content-type: application/json'\
-d'{
 "text": "Supertonic is a lightning fast, on-device TTS system.",
 "voice": "M1",
 "lang": "en",
 "steps": 8,
 "speed": 1.05,
 "response_format": "wav"
 }'\
-ooutput.wav

Response is the audio bytes (audio/wav by default). Useful headers: X-Audio-Duration (seconds), X-Sample-Rate, and X-Supertonic-Version. Supported response_format values: wav, flac, ogg (Vorbis).

OpenAI-compatible /v1/audio/speech β€” clients that already speak the OpenAI API only need to swap the base URL:

curl-XPOSThttp://127.0.0.1:7788/v1/audio/speech\
-H'content-type: application/json'\
-d'{
 "model": "supertonic-3",
 "input": "Supertonic is a lightning fast, on-device TTS system.",
 "voice": "M1",
 "response_format": "wav"
 }'\
-ooutput.wav

Multilingual works the same way β€” set lang to any code from Supported Languages (or na for the fallback).

Custom voices (Voice Builder JSON)

A voice JSON exported from Voice Builder (or any of the bundled ~/.cache/supertonic3/voice_styles/*.json files) can be uploaded once and then referenced by name on every subsequent request.

Import β€” multipart/form-data is the simplest path:

# Upload my_voice.json; the stem of the filename becomes its name.
curl-XPOSThttp://127.0.0.1:7788/v1/styles/import\
-F"file=@voices/my_voice.json"
# β†’ {"name":"my_voice","stored_at":"~/.cache/supertonic3/custom_styles/my_voice.json"}

# Override the name explicitly, and allow overwriting an existing entry:
curl-XPOST"http://127.0.0.1:7788/v1/styles/import?overwrite=true"\
-F"file=@voices/my_voice.json"\
-F"name=demo_voice"

Synthesize with the imported voice β€” just pass its name as voice:

curl-XPOSThttp://127.0.0.1:7788/v1/tts\
-H'content-type: application/json'\
-d'{"text":"Hello in my own cloned voice.","voice":"my_voice","lang":"en"}'\
-ooutput_own_voice.wav

Imported voices are persisted per model alongside the bundled voice styles β€” e.g. ~/.cache/supertonic3/custom_styles/<name>.json for supertonic-3, ~/.cache/supertonic2/custom_styles/<name>.json for supertonic-2. They are re-loaded automatically on the next supertonic serve start. Names that collide with the built-ins (M1–M5, F1–F5) are rejected; existing custom names return 409 unless you pass ?overwrite=true. GET /v1/styles lists everything currently available for the loaded model.

Batch synthesis

POST /v1/tts/batch accepts up to 64 items in a single request and returns each result as base64-encoded audio. Per-item voice / lang / speed can differ β€” useful for narration jobs that mix speakers or languages.

curl-XPOSThttp://127.0.0.1:7788/v1/tts/batch\
-H'content-type: application/json'\
-d'{
 "items": [
 {"text": "Supertonic is a lightning fast, on-device TTS system.", "voice": "M1", "lang": "en"},
 {"text": "νšŒμ˜λŠ” μž μ‹œ 후에 μ‹œμž‘λ˜λ©° λͺ¨λ‘κ°€ μžλ¦¬μ— 앉아 κΈ°λ‹€λ¦½λ‹ˆλ‹€.", "voice": "F1", "lang": "ko"},
 {"text": "La reuniΓ³n comienza pronto y todos se sientan en silencio para escuchar.", "voice": "F1", "lang": "es"}
 ],
 "response_format": "wav",
 "defaults": {"steps": 8, "speed": 1.05}
 }'

Response:

{
"items":[
{"audio_base64":"...","duration_s":4.32,"format":"wav","sample_rate":44100},
{"audio_base64":"...","duration_s":4.88,"format":"wav","sample_rate":44100},
{"audio_base64":"...","duration_s":5.36,"format":"wav","sample_rate":44100}
]
}

Each item carries fully self-contained audio bytes, so writing them out is a one-liner:

curl-fsS-XPOSThttp://127.0.0.1:7788/v1/tts/batch\
-H'content-type: application/json'\
-d'@payload.json'\
|python3-c'
import sys, json, base64, pathlib
for i, item in enumerate(json.load(sys.stdin)["items"]):
 pathlib.Path(f"batch_{i}.wav").write_bytes(base64.b64decode(item["audio_base64"]))
'

Items are processed sequentially (the underlying ONNX session is serialized per process), so batching is about cutting HTTP round-trips and packaging related work together, not about parallel speed-up. Any per-item failure returns a 400 with items[<index>] in the error message β€” no audio is emitted partially.

Requirements

Supertonic has minimal dependencies - just 4 core libraries:

  • onnxruntime - Fast ONNX model inference
  • numpy - Numerical operations
  • soundfile - Audio file I/O
  • huggingface-hub - Model downloads

✨ Highlights

⚑ Blazingly Fast β€” Low-latency, real-time synthesis across desktop, browser, mobile, and edge β€” fast enough to turn an entire webpage into audio in under a second

🌍 31-Language Multilingual β€” Synthesize directly from text across 31 languages, or pass lang="na" to let Supertonic process the text language-agnostically when you don't know the input language β€” no separate language adapters needed

πŸͺΆ 99M-Parameter Open-Weight Model β€” A compact, fully open-weight checkpoint β€” a fraction of the size of 0.7B–2B class open TTS systems β€” for smaller downloads, faster cold starts, and lower memory footprint

πŸ“± Edge-Device Ready β€” Runs locally on desktop, mobile, browsers, and resource-constrained hardware like Raspberry Pi or e-readers, with zero network dependency, complete privacy, and no GPU required

πŸ”Š 44.1kHz High-Quality Audio β€” Outputs studio-grade 44.1kHz 16-bit WAV directly, ready for production playback without any external upsampler

🎭 Expression Tags β€” 10 inline tags (e.g. <laugh>, <breath>, <sigh>) bring natural human nuance into generated speech without prompt engineering or reference audio

πŸ› οΈ Multi-Runtime SDKs β€” Ready-to-use examples through ONNX Runtime across Python, Node.js, Browser (WebGPU), Java, C++, C#, Go, Swift, iOS, Rust, and Flutter

Supported Languages

Supertonic-3 supports the following 31 ISO codes, plus a special na fallback for unknown / unsupported languages:

Code Language Code Language Code Language Code Language
en English ko Korean ja Japanese ar Arabic
bg Bulgarian cs Czech da Danish de German
el Greek es Spanish et Estonian fi Finnish
fr French hi Hindi hr Croatian hu Hungarian
id Indonesian it Italian lt Lithuanian lv Latvian
nl Dutch pl Polish pt Portuguese ro Romanian
ru Russian sk Slovak sl Slovenian sv Swedish
tr Turkish uk Ukrainian vi Vietnamese na unknown / fallback
# Pick any supported code, or use 'na' for text whose language is unknown
wav, _ = tts.synthesize("Some uncommon text.", voice_style=style, lang="na")

Performance

We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

Metrics:

  • Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
  • Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per Second

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 912 1048 1263
Supertonic (M4 pro - WebGPU) 996 1801 2509
Supertonic (RTX4090) 2615 6548 12164
API ElevenLabs Flash v2.5 144 209 287
API OpenAI TTS-1 37 55 82
API Gemini 2.5 Flash TTS 12 18 24
API Supertone Sona speech 1 38 64 92
Open Kokoro 104 107 117
Open NeuTTS Air 37 42 47

Notes: API = Cloud-based API services (measured from Seoul) Open = Open-source models Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX Supertonic (RTX4090): Tested with PyTorch model Kokoro: Tested on M4 Pro CPU with ONNX NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time Factor

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.015 0.013 0.012
Supertonic (M4 pro - WebGPU) 0.014 0.007 0.006
Supertonic (RTX4090) 0.005 0.002 0.001
API ElevenLabs Flash v2.5 0.133 0.077 0.057
API OpenAI TTS-1 0.471 0.302 0.201
API Gemini 2.5 Flash TTS 1.060 0.673 0.541
API Supertone Sona speech 1 0.372 0.206 0.163
Open Kokoro 0.144 0.124 0.126
Open NeuTTS Air 0.390 0.338 0.343

Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category Key Challenges Supertonic ElevenLabs OpenAI Gemini Microsoft
Financial Expression Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes βœ… ❌ ❌ ❌ ❌
Time and Date Time notation, abbreviated weekdays/months, date formats βœ… ❌ ❌ ❌ ❌
Phone Number Area codes, hyphens, extensions (ext.) βœ… ❌ ❌ ❌ ❌
Technical Unit Decimal numbers with units, abbreviated technical notations βœ… ❌ ❌ ❌ ❌

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

Citation

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main Architecture

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech Alignment

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy Labels

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}

Related Projects

🏠 Main Repository: github.com/supertone-inc/supertonic

🎧 Try it live: Hugging Face Spaces

πŸ€— Model Repository: Hugging Face Models (Supertonic-3)

License

Code: MIT License

Model: OpenRAIL-M License

Copyright Β© 2025 Supertone Inc.

Project details

Verified details

These details have been verified by PyPI
Owner
Maintainers
πŸ‘ Avatar for ato_sup from gravatar.com
ato_sup

Unverified details

These details have not been verified by PyPI
Project links
Meta
  • License Expression: MIT
    SPDX License Expression
  • Author: Yu, Yechan
  • Maintainer: Supertone Inc.
  • Tags tts , text-to-speech , speech-synthesis , onnx , deep-learning , audio , voice-synthesis , neural-tts , on-device , tts-server , http-server , fastapi , openai-compatible
  • Requires: Python >=3.9
  • Provides-Extra: playback , serve , dev

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertonic-1.3.1.tar.gz (54.9 kB view details)

Uploaded Source

Built Distribution

Filter files by name, interpreter, ABI, and platform.

If you're not sure about the file name format, learn more about wheel file names.

Copy a direct link to the current filters

supertonic-1.3.1-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file supertonic-1.3.1.tar.gz.

File metadata

  • Download URL: supertonic-1.3.1.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for supertonic-1.3.1.tar.gz
Algorithm Hash digest
SHA256 4367e8f61afea618dac948f6bee55fed4721ad66ca2d3fc90771a2a66740731e
MD5 5495e93d141ef6304c802c1b9f599a82
BLAKE2b-256 e9fe1431393433d0c0570b54b8bd1307502fa0231238ed2cd9506c3e2799a12a

See more details on using hashes here.

File details

Details for the file supertonic-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: supertonic-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 51.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for supertonic-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0079c9d4166008b8a6eeae95f20c092148786b7232192dd3dd9f358960c6c077
MD5 679ff9570b42f47aee97ca42c0b37ca8
BLAKE2b-256 8f9fd3c0367115b378d09a3866502609841c283fbd48c5b8f58902a03f81b752

See more details on using hashes here.

Supported by

πŸ‘ Image
AWS Cloud computing and Security Sponsor πŸ‘ Image
Datadog Monitoring πŸ‘ Image
Depot Continuous Integration πŸ‘ Image
Fastly CDN πŸ‘ Image
Google Download Analytics πŸ‘ Image
Pingdom Monitoring πŸ‘ Image
Sentry Error logging πŸ‘ Image
StatusPage Status page