Irodori-TTS-500M-v2-Character-Voice-Tagger

Model Description

Irodori-TTS-500M-v2-Character-Voice-Tagger is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.

This Tagger variant uses SmilingWolf/wd-vit-tagger-v3 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-SigLIP, uses a SigLIP-based image encoder instead.

Samples

遠くで鳴る夕暮れの鐘が、一日の終わりを告げている。家々の窓には、ぽつりぽつりと暖かな灯りがともり始めた。

Character Image	Generated Audio
👁 Image
👁 Image
👁 Image
👁 Image
👁 Image
👁 Image

Usage

For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.

GitHub: p1atdev/Irodori-Character-Voice
Demo Space: Irodori-TTS-500M-v2-Character-Voice-Demo

For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger together with --character-image. See the GitHub README for complete command examples.

License

This model is released under the MIT License.

Acknowledgments

This model builds on the following projects and resources:

Aratako/Irodori-TTS-500M-v2: base TTS model and architecture/codebase foundation
Aratako/Irodori-TTS: original implementation
Echo-TTS: architecture and training design reference for Irodori-TTS
Aratako/Semantic-DACVAE-Japanese-32dim: audio codec used by Irodori-TTS-500M-v2
SmilingWolf/wd-vit-tagger-v3: image encoder used by this Tagger variant

We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.

Citation

If you use this model in research or a project, please cite:

@misc{character-voice-control,
 author = {Tingrui Zhou and Keiji Yanai},
 title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
 year = {2026},
 eprint = {TODO},
 archivePrefix = {arXiv},
 primaryClass = {cs.SD},
 url = {TODO}
}

Please also cite the original Irodori-TTS model:

@misc{irodori-tts-v2,
 author = {Chihiro Arata},
 title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
 year = {2026},
 publisher = {Hugging Face},
 journal = {Hugging Face repository},
 howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger

Base model

Aratako/Irodori-TTS-500M-v2

Finetuned

(9)

this model

Space using p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger 1

Collection including p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger

Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech • 4 items • Updated May 13

URL: https://huggingface.co/p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger

⇱ p1atdev/Irodori-TTS-500M-v2-Character-Voice-Tagger · Hugging Face