Irodori-TTS-500M-v2-Character-Voice-SigLIP

Model Description

Irodori-TTS-500M-v2-Character-Voice-SigLIP is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.

This SigLIP variant uses SigLIP-v2-B/16-512 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-Tagger, uses a wd-tagger-based image encoder instead.

Samples

遠くで鳴る夕暮れの鐘が、一日の終わりを告げている。家々の窓には、ぽつりぽつりと暖かな灯りがともり始めた。

Character Image	Generated Audio

Usage

For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.

GitHub: p1atdev/Irodori-Character-Voice
Demo Space: Irodori-TTS-500M-v2-Character-Voice-Demo

For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP together with --character-image. See the GitHub README for complete command examples.

License

This model is released under the MIT License.

Acknowledgments

This model builds on the following projects and resources:

Aratako/Irodori-TTS-500M-v2: base TTS model and architecture/codebase foundation
Aratako/Irodori-TTS: original implementation
Echo-TTS: architecture and training design reference for Irodori-TTS
Aratako/Semantic-DACVAE-Japanese-32dim: audio codec used by Irodori-TTS-500M-v2
timm/vit_base_patch16_siglip_512.v2_webli: image encoder used by this SigLIP variant

We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.

Citation

If you use this model in research or a project, please cite:

@misc{character-voice-control,
  author = {Tingrui Zhou and Keiji Yanai},
  title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
  year = {2026},
  eprint = {TODO},
  archivePrefix = {arXiv},
  primaryClass = {cs.SD},
  url = {TODO}
}

Please also cite the original Irodori-TTS model:

@misc{irodori-tts-v2,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}