Three things happened in quick succession. Kokoro-82M went viral because it matched or beat much larger TTS models while running on consumer hardware. Fish Speech 1.5 ranked first among open-source models on TTS-Arena as of early 2025 (FishAudio has since shipped S1 and then S2; S2 now leads open benchmarks including EmergentTTS-Eval, Seed-TTS Eval, and the Audio Turing Test, beating closed-source systems from Google and OpenAI on the win-rate test). Hume released TADA in March 2026, a model built specifically to eliminate hallucinations in long-form synthesis. In the same period, ElevenLabs raised their API prices and PlayHT shut down after Meta acquired it in July 2025.
Developers started asking the obvious question: if open-source TTS is this good, why keep paying per character?
This guide answers that practically. Which GPU you need, what a full deployment looks like, what it costs, and how to wire TTS into a production voice agent. For context on the full voice AI pipeline (ASR + LLM + TTS), see the voice AI GPU infrastructure guide and the Whisper ASR deployment guide for the speech recognition layer. For generative music alongside narration, see the AI music generation GPU deployment guide.
The 2026 Open-Source TTS Models Worth Deploying
Kokoro-82M has 82M parameters and an Apache 2.0 license. The v1.0 release (January 27, 2025) ships 54 voices across 8 languages. Model weights are under 1GB at FP16, though total GPU memory during inference (including CUDA kernels and buffers) runs 2-3GB. It hits an RTF of about 0.03 on an A100. The key advantage is footprint: you can pack many instances onto a single GPU. A community-maintained Docker image (ghcr.io/remsky/kokoro-fastapi-gpu) exposes an OpenAI-compatible API with zero configuration.
Fish Speech 1.5 has an unconfirmed parameter count (estimated ~500M, but no official figure has been published) and ranked first among open-source models on TTS-Arena as of early 2025. Note that FishAudio has since released S1 and then S2; S2 is the current flagship and now holds top positions across EmergentTTS-Eval, Seed-TTS Eval, and the Audio Turing Test. Fish Speech 1.5 is still a solid lightweight self-hostable option, covered here for its small footprint. If you want the best quality and inline emotion control, see the S2 section below. Fish Speech 1.5 supports 13 languages including Chinese, Japanese, and Korean, with emotion and style control via conditioning parameters. VRAM requirement is 12GB minimum, with 24GB recommended for production workloads. Voice cloning from reference audio is built in, no fine-tuning required. License is CC BY-NC-SA 4.0, which means non-commercial use only. Commercial use requires a separate agreement from FishAudio.
Fish Audio S2 is FishAudio's open-weights successor to S1, released in March 2026. It uses a Dual-AR (dual-autoregressive) design: a Slow AR runs along the time axis (~4B parameters) and predicts the primary semantic codebook, while a Fast AR (~400M parameters) fills in the residual codebooks at each step, for roughly 4.4B parameters total. The benchmark results are the headline. S2 posts an 81.88% win rate on EmergentTTS-Eval against a GPT-4o-mini-TTS baseline, the highest of any model evaluated, open or closed, and an Audio Turing Test score of 0.515 against 0.417 for Seed-TTS. On Seed-TTS Eval it records the lowest word error rate of all systems tested, including proprietary ones, at 0.54% in Chinese and 0.99% in English. The differentiator for deployment is free-form inline emotion control: you embed natural-language tags at specific word positions, for example [laugh], [whispers], or [whisper in small voice], rather than passing discrete conditioning parameters. It covers a wide range of languages (FishAudio lists 80+ for the model; the technical report describes training across roughly 50) with voice cloning from a short reference clip, around 15 seconds. Weights, fine-tuning code, and an SGLang-based inference engine ship together. The license is open weights for research and non-commercial use; commercial deployment needs a separate license from FishAudio. There is also a managed Fish Audio API if you would rather not run the GPU yourself, which is often the cheaper path under ~50M characters per month.
Hume TADA (Text-Acoustic Dual Alignment) was released in March 2026 by Hume AI. The headline claim is zero hallucinations on the LibriTTSR test set in long-form synthesis: the model stops and signals rather than inventing words when context is ambiguous. Expressive synthesis with emotional alignment. VRAM is approximately 2.5GB for the 1B model and 9GB for the 3B model with bf16, though independent benchmarks are limited as of April 2026. Weights are available for self-hosting for research and commercial customers.
NVIDIA PersonaPlex-7B is a 7B parameter real-time speech-to-speech conversational model requiring 16GB VRAM minimum, with 24GB+ recommended for smooth real-time performance. It is designed for full-duplex conversations with simultaneous listening and speaking, not a traditional TTS pipeline. Licensed under NVIDIA Open Model License (weights) with MIT license (code). Include it if your application needs live conversational voice interaction; for batch or streaming TTS use cases, Kokoro or Fish Speech are more appropriate.
Model comparison:
| Model | Parameters | VRAM | RTF (A100) | Languages | License | Best For |
|---|---|---|---|---|---|---|
| Kokoro-82M v1.0 | 82M | ~1GB weights (2-3GB total) | ~0.03 | 8 | Apache 2.0 | High-throughput English TTS |
| Fish Speech 1.5 | unconfirmed | ~12GB min | ~0.20 | 13 | CC BY-NC-SA 4.0 | Multilingual, style control |
| Fish Audio S2 | ~4.4B (Dual-AR) | TBA | 0.195 (H200)† | 80+ | Open weights (commercial license req.) | Expressive multilingual TTS, voice cloning, inline emotion control |
| Hume TADA | ~unknown | ~2.5GB (1B) / ~9GB (3B) | ~0.25 est. | English (multi planned) | Commercial | Expressive voice agents |
| PersonaPlex-7B | 7B | 16GB min / 24GB+ rec. | ~0.50 | English | NVIDIA OML / MIT | Full-duplex conversational voice |
RTF figures are estimates based on model architecture and available community benchmarks. Run your own benchmarks with your audio length distribution before capacity planning.
†S2's published RTF of 0.195 is measured on an H200, not an A100, and FishAudio has not published official minimum VRAM or A100 numbers yet. Treat it as a rough indicator and benchmark on your target GPU.
GPU Requirements and Real-Time Factors
RTF (real-time factor) is generation time divided by output audio duration. An RTF above 1.0 means the model cannot keep up with real-time playback. Anything below 0.1 means the GPU is mostly idle when serving a single stream, so you can pack in more concurrent users.
RTF by GPU:
| GPU | Spheron Price | Kokoro-82M RTF | Fish Speech RTF | Concurrent Kokoro streams | Concurrent Fish Speech streams |
|---|---|---|---|---|---|
| L40S PCIe | $1.80/hr | ~0.08 | ~0.30 | ~30 | ~8 |
| A100 PCIe 80GB | $1.04/hr | ~0.03 | ~0.20 | ~50 | ~12 |
| H100 PCIe 80GB | $2.63/hr | ~0.02 | ~0.12 | ~80 | ~20 |
Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Community benchmarks for Kokoro show RTF of ~0.04-0.06 on RTX 4090, which is comparable to the L40S PCIe figures above. Spheron does not currently list RTX 4090 in the GPU catalog. L40S PCIe ($1.80/hr) is the closest available alternative at a similar price point and performs comparably for inference-only workloads.
Step-by-Step: Deploy Kokoro-82M on Spheron GPU Cloud
Provision Your Instance
- Go to app.spheron.ai
- Select A100 PCIe 80GB ($1.04/hr): sufficient for 50+ concurrent Kokoro streams
- Choose Ubuntu 22.04 with at least 50GB storage
- SSH into the instance once it is running
Deploy with Docker
# Pull the community-maintained FastAPI image (GPU variant)
docker pull ghcr.io/remsky/kokoro-fastapi-gpu:latest
# Run with GPU access and expose the API port
docker run -d \
--name kokoro \
--gpus all \
-p 8880:8880 \
-e KOKORO_WORKERS=4 \
ghcr.io/remsky/kokoro-fastapi-gpu:latestCheck the server is ready:
curl http://localhost:8880/healthGenerate Audio
The server exposes an OpenAI-compatible /v1/audio/speech endpoint. You can point any OpenAI TTS client at it by changing the base URL:
curl http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "The GPU cloud is the fastest path from prototype to production.",
"voice": "af_bella",
"response_format": "wav"
}' \
--output output.wavAvailable Voices
Kokoro v1.0 ships 54 voices across 8 languages. Key English voices:
| Voice ID | Style | Notes |
|---|---|---|
af_bella | Female, warm | Default, most tested |
af_sarah | Female, clear | Good for customer service |
am_adam | Male, neutral | Good for narration |
am_michael | Male, authoritative | Good for enterprise apps |
bf_emma | Female, British | UK English accent |
bm_george | Male, British | UK English accent |
Full voice list: curl http://localhost:8880/v1/voices
Streaming Configuration
For voice agents, enable sentence-level streaming to reduce time-to-first-audio:
docker run -d \
--name kokoro \
--gpus all \
-p 8880:8880 \
-e KOKORO_WORKERS=4 \
-e KOKORO_STREAM=true \
-e KOKORO_CHUNK_SIZE=50 \
ghcr.io/remsky/kokoro-fastapi-gpu:latestWith streaming enabled, the server begins emitting audio chunks as soon as the first 50 tokens generate. For a voice agent, this means the user hears the start of the response while the GPU is still processing the tail end.
Step-by-Step: Deploy Fish Speech
Instance Requirements
Fish Speech 1.5 needs 12GB VRAM minimum per model instance, with 24GB recommended for production. An A100 PCIe 80GB can run up to 6 instances in parallel at minimum VRAM. H100 PCIe improves throughput by roughly 2x for high-concurrency serving. Start with A100 PCIe unless you are targeting sub-100ms latency at scale.
Installation
# Clone the repository
git clone https://github.com/fishaudio/fish-speech
cd fish-speech
# Install with CUDA-version-specific extras (use cu129, cu128, cu126, or cpu based on your CUDA version)
pip install -e '.[cu126]'
# Download model weights (~1.5GB)
pip install huggingface_hub
huggingface-cli download fishaudio/fish-speech-1.5 \
--local-dir checkpoints/fish-speech-1.5Start the Inference Server
# Start the web UI and API server on all interfaces
python tools/run_webui.py \
--listen 0.0.0.0:7860 \
--checkpoint-path checkpoints/fish-speech-1.5The API is available at /api/v1/tts. For production deployments, run behind Nginx with rate limiting. Do not expose port 7860 directly; use an SSH tunnel for testing.
Generate Speech with Language and Emotion Control
import requests
response = requests.post(
"http://localhost:7860/api/v1/tts",
json={
"text": "Your GPU deployment is ready.",
"language": "en",
"speaker": None, # None uses default speaker
"emotion": "neutral", # Options: neutral, happy, sad, angry, fearful, disgusted, surprised
"format": "wav",
"streaming": False
}
)
response.raise_for_status()
with open("output.wav", "wb") as f:
f.write(response.content)Voice Cloning
Fish Speech clones voices from a reference audio clip. No fine-tuning required:
import requests
with open("reference.wav", "rb") as ref:
response = requests.post(
"http://localhost:7860/api/v1/tts",
data={
"text": "Cloned voice generation test.",
"language": "en",
"format": "wav"
},
files={
"reference_audio": ref,
"reference_text": (None, "Transcript of the reference audio clip.")
}
)
response.raise_for_status()
with open("cloned.wav", "wb") as f:
f.write(response.content)Reference audio recommendations: 5-15 seconds, clean recording, minimal background noise, consistent energy. Shorter clips work but speaker similarity drops below 85%.
License Note
Fish Speech 1.5 is licensed CC BY-NC-SA 4.0. This allows non-commercial use with attribution. For commercial applications, contact FishAudio for a commercial license before deploying to production.
Step-by-Step: Deploy Fish Audio S2
S2 is the model to reach for when quality and expressiveness matter more than raw footprint. It ships as a complete system: weights, fine-tuning code, and an SGLang-based serving stack. Because the Dual-AR architecture is structurally close to a standard autoregressive LLM, S2 inherits the usual LLM serving optimizations (continuous batching, paged KV cache, CUDA graph replay, and RadixAttention prefix caching) instead of needing custom inference infrastructure.
Before you wire up a GPU, decide whether you even need to. If you are under ~50M characters per month, FishAudio's managed Text-to-Speech API is often cheaper than running a dedicated A100 PCIe and carries zero GPU overhead, so you can skip this whole section and call S2 over HTTP. Self-host when you need higher volume, data residency, or full production independence. The steps below cover self-hosting the open weights.
Instance Requirements
FishAudio has not published official minimum VRAM or concurrent-stream benchmarks for S2 yet, so size conservatively and measure on your own hardware. The reference numbers FishAudio reports are from a single H200: a real-time factor of 0.195, time-to-first-audio around 100ms, and throughput above 3,000 acoustic tokens per second while keeping RTF under 0.5. An H100 PCIe is a sensible starting point for self-hosting; benchmark before you commit to capacity. For the H200 setup the numbers were measured on, an H200 instance is the closest match. Check current GPU pricing for live rates.
Installation and Server Setup
S2 deploys through SGLang-Omni. Pull the weights from Hugging Face (fishaudio/s2) and follow the SGLang-Omni README for Fish Audio S2, which carries the current launch flags and hardware notes. The serving path is the same one FishAudio benchmarks against, so the published latency figures are a reasonable guide to what you will see in production.
# Clone the inference engine and the model repo
git clone https://github.com/sgl-project/sglang-omni
git clone https://github.com/fishaudio/fish-speech
# Pull S2 weights from Hugging Face
pip install huggingface_hub
huggingface-cli download fishaudio/s2 --local-dir checkpoints/s2Follow the SGLang-Omni README for the exact launch command and serving flags, as those track the latest release.
Inline Emotion Control
This is the feature that sets S2 apart in a deployment. Instead of a fixed emotion parameter, you embed natural-language tags inline at the word or phrase where you want them to take effect:
Welcome back. [laugh] I did not expect to see you here.
[whisper in small voice] Keep this between us for now.Tags are free-form, so [professional broadcast tone] or [pitch up] work alongside the common ones like [laugh] and [whispers]. This makes S2 a strong fit for dialogue, audiobooks, and character voices where emotion shifts within a single line.
Voice Cloning
Voice cloning uses a short reference clip, around 15 seconds. The reference tokens go in the system prompt, and SGLang's RadixAttention caches those KV states automatically. FishAudio reports an average prefix-cache hit rate of 86.4% (over 90% at peak) when the same voice is reused across requests, which makes repeated-voice prefill overhead close to free. In practice, pin one voice per worker so the cache stays warm.
License Note
S2 is open weights for research and non-commercial use. Commercial deployment requires a separate license from FishAudio, so sort that out before you put it in front of customers. If you would rather skip the GPU entirely, use FishAudio's managed Text-to-Speech API instead; check their pricing page for current rates.
Serving Architecture: Batch vs Real-Time Streaming
Batch Processing
Batch processing fits workloads where latency is not the constraint: audiobook generation, podcast production, pre-rendered game dialogue, content localization.
Architecture:
- Request queue (Redis or SQS)
- Worker pool pulling from queue
- Output written to object storage (S3-compatible)
- Webhook notification on completion
# Worker loop for batch processing
import redis
import requests
import json
import time
queue = redis.Redis(host='localhost', port=6379)
connection_retries = 0
MAX_RETRIES = 3
while True:
try:
_, job = queue.blpop('tts_queue')
connection_retries = 0 # Reset backoff on successful connection
try:
job_data = json.loads(job)
except (json.JSONDecodeError, KeyError) as e:
print(f"Malformed job payload, dead-lettering: {e}. Raw: {job!r}")
queue.rpush('tts_queue_failed', job)
continue
try:
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"model": "kokoro",
"input": job_data["text"],
"voice": job_data.get("voice", "af_bella"),
"response_format": "mp3"
}
)
response.raise_for_status()
# Write to S3 or local storage
upload_to_storage(response.content, job_data["output_key"])
notify_webhook(job_data["callback_url"], job_data["output_key"])
except Exception as e:
# Track retry count to avoid re-queuing bad jobs indefinitely
attempts = job_data.get("_attempts", 0) + 1
if attempts < MAX_RETRIES:
job_data["_attempts"] = attempts
print(f"Job failed (attempt {attempts}/{MAX_RETRIES}): {e}. Re-queuing.")
queue.rpush('tts_queue', json.dumps(job_data))
else:
# Move to dead-letter queue after max retries
print(f"Job failed permanently after {MAX_RETRIES} attempts: {e}. Moving to dead-letter queue.")
queue.rpush('tts_queue_failed', json.dumps(job_data))
except redis.exceptions.ConnectionError as e:
# Back off exponentially to avoid tight CPU spin on Redis outage
wait = min(2 ** connection_retries, 60)
print(f"Redis connection error: {e}. Retrying in {wait}s.")
time.sleep(wait)
connection_retries += 1Throughput target on A100 PCIe with Kokoro: approximately 50 million characters per hour for typical short-form text.
Real-Time Streaming
Streaming is for interactive applications where the user hears audio before the full response is synthesized: voice agents, interactive demos, live narration.
The key technique is sentence-boundary detection. The LLM generates tokens, the application detects sentence boundaries (period, question mark, exclamation), and passes each complete sentence to TTS immediately. The user begins hearing audio ~200-400ms after the LLM generates the first sentence.
def stream_llm_to_tts(llm_stream, tts_client):
"""Pipeline LLM tokens into TTS with sentence-boundary buffering."""
buffer = ""
sentence_enders = {'.', '!', '?', ':', ';'}
for token in llm_stream:
buffer += token
# Flush on sentence boundary
if buffer and buffer[-1] in sentence_enders and len(buffer) > 10:
audio_chunk = tts_client.synthesize(buffer.strip())
yield audio_chunk
buffer = ""
# Flush remainder
if buffer.strip():
audio_chunk = tts_client.synthesize(buffer.strip())
yield audio_chunkFor the full voice AI pipeline combining ASR, LLM, and TTS on a single GPU, see the voice AI GPU infrastructure guide.
Cost Analysis: Self-Hosted TTS vs ElevenLabs and PlayHT
The comparison below uses on-demand pricing as the baseline, not spot instances. This gives a conservative view of self-hosting costs. Spot instances are cheaper but interruptible, which is fine for batch workloads but not for real-time production serving.
Cost comparison (characters per month):
| Monthly Volume | ElevenLabs (Scale) | PlayHT (Pro) | Kokoro on A100 PCIe | Fish Speech on A100 PCIe |
|---|---|---|---|---|
| 1M chars | $180 | $49 | $748.80* | $748.80* |
| 10M chars | $1,800 | $490 | $748.80* | $748.80* |
| 50M chars | $9,000 | $2,450 | $748.80* | $748.80* |
| 100M chars | $18,000 | $4,900 | $748.80* | $748.80* (or 2x A100) |
*$748.80/month = $1.04/hr x 720 hours (dedicated A100 PCIe). One A100 running Kokoro handles approximately 3.6 billion characters per month at steady utilization.
Note: ElevenLabs Scale plan pricing is approximate and changes periodically. Verify current rates at elevenlabs.io before building a cost model.
Pricing fluctuates based on GPU availability. The Spheron GPU costs above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
At under 4M characters per month, API pricing is usually cheaper because you avoid the fixed GPU cost. Self-hosting becomes economical at 4-5M+ characters per month and substantially cheaper above 10M.
Fish Speech on A100 PCIe handles approximately 200-400 million characters per month at steady utilization (lower than Kokoro due to higher RTF). At 100M+ chars/month, one A100 covers it. Above that, add instances.
Production Scaling: Handling 1000+ Concurrent Requests
Single GPU Capacity
| GPU | Kokoro streams | Fish Speech streams | Notes |
|---|---|---|---|
| A100 PCIe 80GB | 50 | 12 | Good starting point |
| H100 PCIe 80GB | 80 | 20 | Lower latency under load |
Horizontal Scaling
Beyond one GPU, run independent TTS containers on multiple instances and load balance with Nginx:
upstream kokoro_backends {
least_conn;
server gpu-instance-1:8880;
server gpu-instance-2:8880;
server gpu-instance-3:8880;
server gpu-instance-4:8880;
}
server {
listen 80;
location /v1/audio/speech {
proxy_pass http://kokoro_backends;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
}
location /health {
proxy_pass http://kokoro_backends;
}
}Nginx least_conn distributes requests to the backend with the fewest active connections. This performs better than round-robin for TTS because request duration varies significantly by text length.
GPU Monitoring
Monitor GPU utilization with nvidia-smi during load testing:
# Real-time GPU stats, 2-second refresh
watch -n 2 nvidia-smi
# Log utilization to CSV for capacity planning
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free \
--format=csv,noheader \
--loop=5 >> gpu_utilization.csvTarget 80-90% GPU utilization for cost efficiency. Below 70% means the instance is over-provisioned. Above 95% means requests are queuing.
Building Voice AI Agents: Combining TTS with STT and LLM
The standard architecture is three stages on one GPU: ASR (Whisper), LLM response generation, TTS synthesis. The models stack well on a single H100 PCIe 80GB:
- H100 PCIe 80GB: Whisper Large v3 (~4GB) + a 13B LLM (~26GB FP16) + Kokoro (~2-3GB total) leaves 47GB+ headroom for KV cache
- A100 PCIe 80GB: Whisper Medium (~3GB) + a 7B LLM (~14GB FP16) + Kokoro (~2-3GB total) fits comfortably
TTS is the cheapest stage in the pipeline by VRAM for Kokoro. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a larger LLM on the same instance. Fish Speech at 12GB minimum still fits alongside a 7B LLM and Whisper on an 80GB GPU, though with less headroom.
The sentence-streaming pattern keeps end-to-end latency under 500ms. As the LLM generates tokens, detect sentence boundaries and pass each complete sentence to TTS immediately. The user hears the first sentence while the LLM is still generating the rest of the response.
The voice AI GPU infrastructure guide has VRAM requirement breakdowns for each pipeline stage, latency budget analysis, and full GPU recommendations for ASR + LLM + TTS co-location. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a 7B-13B LLM and Whisper on the same H100 PCIe 80GB instance. Fish Speech at 12GB minimum also fits alongside a 7B LLM and Whisper on an H100 PCIe 80GB, with roughly 40GB remaining for KV cache.
For NeuTTS Air users who need faster throughput or voice cloning, see the NeuTTS Air deployment guide.
For XTTS-2, F5-TTS, and OpenVoice V2, the voice cloning deployment guide covers GPU sizing, per-character cost tables, speaker embedding caching, and a complete FastAPI production setup.
Which Model Should You Deploy?
Use Kokoro-82M if: you are building an English-first voice application, you need maximum throughput on minimal hardware, or you want an OpenAI-compatible API with zero configuration.
Use Fish Speech if: your application handles multiple languages, you need style or emotion control, or you want voice cloning without fine-tuning for a non-English audience.
Use Fish Audio S2 if: you need the highest benchmark performance for expressive multilingual TTS, free-form word-level emotion control, or AI voice cloning across 80+ languages. At under ~50M characters per month, the managed Fish Audio API is often cheaper than a dedicated A100 PCIe and requires zero GPU overhead. For higher volumes or full production independence, self-host via SGLang-Omni.
Use Hume TADA if: you are building an emotionally expressive voice agent and can tolerate higher VRAM requirements and less community documentation as of April 2026.
Use PersonaPlex-7B if: you need full-duplex real-time conversational voice (simultaneous listening and speaking), not a traditional TTS pipeline. It is not suited for batch synthesis or standard streaming TTS use cases.
For the cost-sensitive production path: start with Kokoro on A100 PCIe. Add Fish Speech on the same instance if you need multilingual support. Upgrade to H100 PCIe when you cross 80 concurrent streams.
Open-source TTS on GPU cloud is genuinely cost-effective at scale. A dedicated A100 PCIe running Kokoro costs $1.04/hr and covers 50+ concurrent streams, at a fraction of per-character API pricing above 4-5M characters per month. Spheron has A100 and H100 instances on-demand with no minimums or long-term contracts.
On-demand A100 → | On-demand H100 → | View all GPU pricing → | Get started on Spheron →
Quick Setup Guide
Choose the right TTS model for your use case
Compare Kokoro-82M (ultra-fast, ~1GB model weights, English-focused, 54 voices), Fish Speech (multilingual, emotion control, 12GB VRAM minimum, ranked #1 open-source on TTS-Arena as of early 2025), Fish Audio S2 (open-weights, ~4.4B Dual-AR, top EmergentTTS-Eval win rate, inline emotion control, March 2026), and Hume TADA (expressive, zero-hallucination claim, March 2026). Kokoro for throughput. Fish Speech for a small multilingual footprint. Fish Audio S2 for the highest expressive multilingual quality. Hume TADA for emotionally expressive voice agents.
Provision a GPU instance on Spheron
Go to app.spheron.ai. For Kokoro-only workloads, an A100 PCIe 80GB ($1.04/hr) handles 50+ concurrent streams. For Fish Speech or Hume TADA, A100 PCIe 80GB is sufficient; H100 PCIe ($2.63/hr) reduces latency. Select Ubuntu 22.04 with at least 50GB storage. SSH into the instance once it provisions.
Deploy Kokoro-82M with the FastAPI Docker image
Run: docker pull ghcr.io/remsky/kokoro-fastapi-gpu:latest && docker run -d --name kokoro --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest. The container exposes an OpenAI-compatible /v1/audio/speech endpoint. Test with: curl http://localhost:8880/v1/audio/speech -H 'Content-Type: application/json' -d '{"model":"kokoro","input":"Hello world","voice":"af_bella"}' --output test.wav
Deploy Fish Speech from source
git clone https://github.com/fishaudio/fish-speech && cd fish-speech && pip install -e '.[cu126]' && huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5. Start the inference server: python tools/run_webui.py --listen 0.0.0.0:7860 --checkpoint-path checkpoints/fish-speech-1.5. Use the REST API at /api/v1/tts for programmatic access with language, style, and speaker parameters.
Configure streaming for real-time voice agents
For Kokoro: set KOKORO_STREAM=true and KOKORO_CHUNK_SIZE=50 in the Docker environment. Audio chunks begin streaming before the full text processes. For Fish Speech: pass streaming=true in the API request body. Connect to your LLM with sentence-boundary detection so TTS begins as soon as the LLM emits the first complete sentence, cutting perceived latency by 40-60%.
Set up production load balancing
Run multiple Kokoro or Fish Speech containers behind an Nginx upstream block. Each A100 PCIe can host 2-4 Fish Speech workers or 8-16 Kokoro workers depending on concurrency targets. Use health checks at /health and sticky sessions if using stateful voice cloning. Monitor GPU utilization with nvidia-smi - target 80-90% for cost efficiency.
Frequently Asked Questions
Kokoro-82M model weights fit under 1GB VRAM (total GPU memory during inference, including CUDA buffers, is typically 2-3GB) and it runs on any modern NVIDIA GPU. An A100 PCIe 80GB ($1.04/hr on Spheron) handles 50+ concurrent Kokoro streams at well under 0.1 real-time factor. For very high throughput (1000+ concurrent requests), scale horizontally with additional A100 or H100 PCIe instances.
ElevenLabs charges $0.00018 per character as the overage rate on Scale plans ($0.18 per 1,000 characters overage; Scale plans include 2M-4M characters in the $330/month base). At 10 million characters per month in overage, that is $1,800 in usage fees alone. Running Kokoro on a dedicated A100 PCIe at $1.04/hr is $748.80/month and handles far more than 10M characters at that volume. Self-hosting breaks even around 4-5M characters per month and gets significantly cheaper as volume grows.
RTF is generation time divided by output audio duration - lower is better. Kokoro-82M achieves RTF of about 0.03 on an A100, meaning 10 seconds of audio generates in well under a second. Fish Speech RTF is approximately 0.15-0.25 on A100. Hume TADA has limited public benchmarks but is estimated at 0.20-0.30 RTF on A100.
Yes. Kokoro-82M model weights fit under 1GB (total GPU memory during inference is typically 2-3GB). Fish Speech needs 12GB minimum (24GB recommended). Hume TADA needs approximately 2.5GB (1B model) or 9GB (3B model with bf16). An A100 80GB can still host multiple models simultaneously and route requests based on language, quality tier, or latency requirements. No separate instances needed until you hit throughput limits.
Fish Audio S2 is FishAudio's open-weights successor to S1, released March 2026. It uses a Dual-AR architecture totaling about 4.4B parameters and posts the highest EmergentTTS-Eval win rate (81.88% vs GPT-4o-mini-TTS) of any model tested, open or closed. FishAudio has not published official minimum VRAM yet; its reference benchmarks run on a single H200 at 0.195 real-time factor with about 100ms time-to-first-audio. An H100 or H200 is a sensible starting point. It deploys via SGLang-Omni and supports free-form inline emotion control and voice cloning from a ~15-second reference clip. Weights are open for research and non-commercial use; commercial deployment needs a separate license.
For low-latency English voice agents, Kokoro-82M is the fastest option with the smallest VRAM footprint. Pair it with Whisper Large v3 for ASR and a 7B-13B LLM. For multilingual agents or applications needing style and emotion control, Fish Speech is the better choice despite higher VRAM requirements. See the voice AI GPU infrastructure guide for full pipeline recommendations.
