VOOZH about

URL: https://huggingface.co/p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger

โ‡ฑ p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger ยท Hugging Face


Irodori-TTS-500M-v2-Character-Voice-Tagger

๐Ÿ‘ Project Page
๐Ÿ‘ arXiv
๐Ÿ‘ GitHub
๐Ÿ‘ Demo

Model Description

Irodori-TTS-500M-v2-Character-Voice-Tagger is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.

This Tagger variant uses SmilingWolf/wd-vit-tagger-v3 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-SigLIP, uses a SigLIP-based image encoder instead.

Samples

้ ใใง้ณดใ‚‹ๅค•ๆšฎใ‚Œใฎ้˜ใŒใ€ไธ€ๆ—ฅใฎ็ต‚ใ‚ใ‚Šใ‚’ๅ‘Šใ’ใฆใ„ใ‚‹ใ€‚ๅฎถใ€…ใฎ็ช“ใซใฏใ€ใฝใคใ‚Šใฝใคใ‚Šใจๆš–ใ‹ใช็ฏใ‚ŠใŒใจใ‚‚ใ‚Šๅง‹ใ‚ใŸใ€‚

Usage

For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.

For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger together with --character-image. See the GitHub README for complete command examples.

License

This model is released under the MIT License.

Acknowledgments

This model builds on the following projects and resources:

We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.

Citation

If you use this model in research or a project, please cite:

@misc{character-voice-control,
 author = {Tingrui Zhou and Keiji Yanai},
 title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
 year = {2026},
 eprint = {TODO},
 archivePrefix = {arXiv},
 primaryClass = {cs.SD},
 url = {TODO}
}

Please also cite the original Irodori-TTS model:

@misc{irodori-tts-v2,
 author = {Chihiro Arata},
 title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
 year = {2026},
 publisher = {Hugging Face},
 journal = {Hugging Face repository},
 howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
ยท

Model tree for p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger

Finetuned
(9)
this model

Space using p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger 1

Collection including p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger