Convert text to speech
curl --request POST \
--url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
--header 'Content-Type: application/json' \
--header 'x-sup-api-key: <api-key>' \
--data '
{
"text": "<string>",
"style": "<string>",
"model": "sona_speech_1",
"output_format": "wav",
"voice_settings": {
"pitch_shift": 0,
"pitch_variance": 1,
"speed": 1,
"duration": 0,
"similarity": 3,
"text_guidance": 1,
"subharmonic_amplitude_control": 1
},
"include_phonemes": false,
"normalized_text": "<string>"
}
'"<string>"POST
/
v1
/
text-to-speech
/
{voice_id}
Convert text to speech
curl --request POST \
--url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
--header 'Content-Type: application/json' \
--header 'x-sup-api-key: <api-key>' \
--data '
{
"text": "<string>",
"style": "<string>",
"model": "sona_speech_1",
"output_format": "wav",
"voice_settings": {
"pitch_shift": 0,
"pitch_variance": 1,
"speed": 1,
"duration": 0,
"similarity": 3,
"text_guidance": 1,
"subharmonic_amplitude_control": 1
},
"include_phonemes": false,
"normalized_text": "<string>"
}
'"<string>"Generates speech from text and returns the audio in the response body. For the conceptual walkthrough, SDK examples, and tips, see Docs: Create speech.
Endpoint
POST https://supertoneapi.com/v1/text-to-speech/{voice_id}
Path parameters
| Name | Required | Description |
|---|---|---|
voice_id | ✅ | The ID of the target voice. |
Request body
| Name | Required | Description |
|---|---|---|
text | ✅ | The text to convert. Max 300 characters. Use an SDK or split client-side for longer input. |
language | ✅ | Language code (e.g. en, ko, ja). Must be supported by the voice and the model. |
style | — | Emotional style (e.g. neutral, happy). If omitted, the voice’s default style is used. |
model | — | TTS model. Defaults to sona_speech_1. |
output_format | — | wav (default) or mp3. |
voice_settings | — | Advanced voice parameters (see below). |
include_phonemes | — | If true, response switches to JSON with base64 audio plus phoneme timing data. Default: false. |
normalized_text | — | Pronunciation-normalized companion text (used by sona_speech_2 and sona_speech_2_flash, primarily for Japanese). |
Supported languages by model
| Model | Languages |
|---|---|
sona_speech_2, sona_speech_2_flash | en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi |
supertonic_api_3 | en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi |
supertonic_api_1 | en, ko, ja, es, pt |
sona_speech_1 | en, ko, ja |
Voice settings
Unsupported settings are silently ignored — they don’t error.| Name | Range | Default | Description |
|---|---|---|---|
pitch_shift | -24 → 24 | 0 | Pitch shift in semitones. |
pitch_variance | 0 → 2 | 1 | Degree of pitch variation. |
speed | 0.5 → 2 | 1 | Playback rate multiplier. Applied after duration. |
duration | 0 → 60 | 0 | When non-zero, generates audio targeting this length in seconds. |
similarity | 1 → 5 | 3 | How closely the output matches the original character voice. |
text_guidance | 0 → 4 | 1 | How sensitively delivery adapts to the text content. |
subharmonic_amplitude_control | 0 → 2 | 1 | Subharmonic amplitude in the generated speech. |
Voice settings by model
| Setting | sona_speech_2 | sona_speech_2_flash | supertonic_api_3 | supertonic_api_1 | sona_speech_1 |
|---|---|---|---|---|---|
pitch_shift, pitch_variance, duration | ✅ | ✅ | — | — | ✅ |
speed | ✅ | ✅ | ✅ | ✅ | ✅ |
similarity, text_guidance | ✅ | — | — | — | ✅ |
subharmonic_amplitude_control | — | — | — | — | ✅ |
Response
Default (include_phonemes=false): Binary audio in the body.
Content-Type: audio/wavoraudio/mpeg(matchesoutput_format).X-Audio-Lengthheader: duration of the generated audio in seconds.
include_phonemes=true: JSON body with base64 audio plus phoneme arrays.
{
"audio_base64": "UklGRnoGAABXQVZF...",
"phonemes": {
"symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
"start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
"durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
}
}
Notes
textover 300 characters returns400. Use the Python or TypeScript SDK for automatic chunking, or split manually — see Long text.speedapplies afterduration. Settingduration=5withspeed=2produces ~10 seconds of audio.- When
styleis omitted, the first value in the voice’sstylesarray is used. Different voices can have different defaults — call Get voice to check.
See also
Docs: Create speech
Walkthrough with SDK examples.
Stream speech
Stream audio chunks instead of waiting for the full clip.
Authorizations
Path Parameters
Body
application/json
The text to convert to speech
Maximum string length:
300The language code of the text
Available options:
en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi, hr, lt, lv, sk, sl, sv, tr, uk The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
Available options:
sona_speech_1, sona_speech_2, sona_speech_2_flash, supertonic_api_1, supertonic_api_3 The desired output format of the audio file (wav, mp3). Default is wav.
Available options:
wav, mp3 Return phoneme timing data with the audio
Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.
Response
Returns either binary audio or JSON with phoneme data based on include_phonemes parameter
Binary audio file (when include_phonemes=false or omitted)
⌘I
