Irodori-TTS-500M-v2-Character-Voice-SigLIP
Model Description
Irodori-TTS-500M-v2-Character-Voice-SigLIP is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.
This SigLIP variant uses SigLIP-v2-B/16-512 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-Tagger, uses a wd-tagger-based image encoder instead.
Samples
้ ใใง้ณดใๅคๆฎใใฎ้ใใไธๆฅใฎ็ตใใใๅใใฆใใใๅฎถใ ใฎ็ชใซใฏใใฝใคใใฝใคใใจๆใใช็ฏใใใจใใๅงใใใ
| Character Image | Generated Audio |
|---|---|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
Usage
For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.
- GitHub: p1atdev/Irodori-Character-Voice
- Demo Space: Irodori-TTS-500M-v2-Character-Voice-Demo
For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP together with --character-image.
See the GitHub README for complete command examples.
License
This model is released under the MIT License.
Acknowledgments
This model builds on the following projects and resources:
- Aratako/Irodori-TTS-500M-v2: base TTS model and architecture/codebase foundation
- Aratako/Irodori-TTS: original implementation
- Echo-TTS: architecture and training design reference for Irodori-TTS
- Aratako/Semantic-DACVAE-Japanese-32dim: audio codec used by Irodori-TTS-500M-v2
- timm/vit_base_patch16_siglip_512.v2_webli: image encoder used by this SigLIP variant
We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.
Citation
If you use this model in research or a project, please cite:
@misc{character-voice-control,
author = {Tingrui Zhou and Keiji Yanai},
title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
year = {2026},
eprint = {TODO},
archivePrefix = {arXiv},
primaryClass = {cs.SD},
url = {TODO}
}
Please also cite the original Irodori-TTS model:
@misc{irodori-tts-v2,
author = {Chihiro Arata},
title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}
Model tree for p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP
Base model
Aratako/Irodori-TTS-500M-v2




