Irodori-TTS-500M-v2-Character-Voice-SigLIP

Project Page arXiv GitHub Demo

Model Description

Irodori-TTS-500M-v2-Character-Voice-SigLIP is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.

This SigLIP variant uses SigLIP-v2-B/16-512 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-Tagger, uses a wd-tagger-based image encoder instead.

Samples

้ ใใง้ณดใ‚‹ๅค•ๆšฎใ‚Œใฎ้˜ใŒใ€ไธ€ๆ—ฅใฎ็ต‚ใ‚ใ‚Šใ‚’ๅ‘Šใ’ใฆใ„ใ‚‹ใ€‚ๅฎถใ€…ใฎ็ช“ใซใฏใ€ใฝใคใ‚Šใฝใคใ‚Šใจๆš–ใ‹ใช็ฏใ‚ŠใŒใจใ‚‚ใ‚Šๅง‹ใ‚ใŸใ€‚

Character Image Generated Audio

Usage

For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.

For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP together with --character-image. See the GitHub README for complete command examples.

License

This model is released under the MIT License.

Acknowledgments

This model builds on the following projects and resources:

We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.

Citation

If you use this model in research or a project, please cite:

@misc{character-voice-control,
  author = {Tingrui Zhou and Keiji Yanai},
  title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
  year = {2026},
  eprint = {TODO},
  archivePrefix = {arXiv},
  primaryClass = {cs.SD},
  url = {TODO}
}

Please also cite the original Irodori-TTS model:

@misc{irodori-tts-v2,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP

Finetuned
(6)
this model

Spaces using p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP 2

Collection including p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP