VOOZH about

URL: https://deepinfra.com/models/text-to-speech

⇱ Models | Machine Learning Inference | DeepInfra


We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud β€” read the announcement

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

​
featured
Qwen3-TTS

Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages. Key capabilities: - 9 preset voices β€” Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee β€” covering diverse genders, ages, and accents - Voice cloning β€” clone any voice from a short (~3s) audio sample via the voice_id parameter - Instruction control β€” adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone") - 10 languages β€” English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support β€” real-time PCM streaming with ~97ms first-byte latency - Multiple output formats β€” WAV, MP3, FLAC, PCM Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and environmental audio details.
$20.00 per 1M characters
featured
Qwen3-TTS-VoiceDesign

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language β€” and the model generates speech in that voice. Key capabilities: - Natural language voice control β€” describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone") - 10 languages β€” English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support β€” real-time PCM streaming - Multiple output formats β€” WAV, MP3, FLAC, PCM Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.
$20.00 per 1M characters
chatterbox-multilingual

09/04 πŸ”₯ Introducing Chatterbox Multilingual in 23 Languages! We're excited to introduce Chatterbox and Chatterbox Multilingual, Resemble AI's production-grade open source TTS models. Chatterbox Multilingual supports Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
$1.00 per 1M characters
chatterbox-turbo

Chatterbox is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI. We are excited to introduce Chatterbox-Turbo, our most efficient model yet. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just one, while retaining high-fidelity audio output. Paralinguistic tags are now native to the Turbo model, allowing you to use [cough], [laugh], [chuckle], and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows. If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link).
$1.00 per 1M characters
MiMo-V2.5-tts

Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice. Use the high-quality voices from the built-in voices list.
Partner
$0.00 per 1M characters
MiMo-V2.5-tts-voiceclone

Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech. Precisely replicate voices from audio samples to enable speech synthesis of any voice. style and voice.
Partner
$0.00 per 1M characters
MiMo-V2.5-tts-voicedesign

Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice. Automatically generate voices from text descriptions, without requiring presets or audio samples.
Partner
$0.00 per 1M characters
text-to-speech
Zyphra/
Zonos-v0.1-hybrid

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par withβ€”or even surpassingβ€”top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.
Deprecated
text-to-speech
Zyphra/
Zonos-v0.1-transformer

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par withβ€”or even surpassingβ€”top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.
Deprecated
HiggsAudioV2.5

HiggsAudioV2.5 is a high-quality neural text-to-speech (TTS) model designed for natural-sounding voice generation across a wide range of use cases. It focuses on clarity, stable prosody, and consistent pacing, making it suitable for both short prompts and longer narration.
$20.00 per 1M characters
orpheus-3b-0.1-ft

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.
$7.00 per 1M characters
Kokoro-82M

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
Priority
$0.62 per 1M characters
realtime-tts-1.5-max

High-quality multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for natural, expressive speech with <250ms time-to-first-audio.
Partner
$50.00 per 1M characters
realtime-tts-1.5-mini

Fast multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for low-latency applications with <130ms time-to-first-audio.
Partner
$25.00 per 1M characters
realtime-tts-2

Realtime TTS 2.0 is a low-latency text-to-speech model with natural language steering, allowing you to control tone and emotion directly in the prompt (e.g., β€œ[be happy and upbeat] Hello!”). It supports cross-lingual voices and multiple languages, enabling the same voice to speak consistently across different languages. This is an early access preview ahead of full launch, with ongoing improvements to voice quality and steering.
Partner
$35.00 per 1M characters
text-to-speech
sesame/
csm-1b

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
$7.00 per 1M characters
πŸ‘ Built With Love in Palo Alto

Β© 2026 DeepInfra. All rights reserved.