![]() |
VOOZH | about |
AI/ML Technical Content Strategist
Large Language Modeling has been, for very good reason, one of the most prominent and effective results to come from the AI revolution. These models have enabled numerous applications in different fields, including knowledgeable chatbots, functional agents, and general text generation. Correspondingly, there has been a race to combine different modalities with the power of these models. From vision understanding to function calling to speech generation, the race has been on to make these models even more connective and useful.
One of the awesome, potential use-cases for Large Language Models is generating large swathes of text for audio subject matter, like podcasts, scripts, or even entire stories. With that, comes an interesting question: can AI make human sounding audio generations?
In this article, we are going to review four of the best, open-source Text-to-Speech (TTS) models. Specifically, we will compare the effectiveness of F5-TTs, Kokoro, SparkTTS, and the newly released Sesame at generating a paragraph of speech audio. We will both make a qualitative assessment of the speech’s closeness to the input & the use of punctuation and pauses. Together, we hope these tests give a concrete answer as to which model might be the best for any use-case. We will also note where some models are faster than others, though they are almost all blindingly fast.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.