GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS
π» GitHub Repository | π οΈ Audio.Z.AI
π Model Introduction
GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.
By introducing a Multi-Reward Reinforcement Learning framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
Key Features
- Zero-shot Voice Cloning: Clone any speaker's voice with just 3-10 seconds of prompt audio.
- RL-enhanced Emotion Control: Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
- High-quality Synthesis: Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
- Phoneme-level Control: Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
- Streaming Inference: Supports real-time audio generation suitable for interactive applications.
- Bilingual Support: Optimized for Chinese and English mixed text.
System Architecture
GLM-TTS follows a two-stage design:
- Stage 1 (LLM): A Llama-based model converts input text into speech token sequences.
- Stage 2 (Flow Matching): A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.
Reinforcement Learning Alignment
To tackle flat emotional expression, GLM-TTS uses a Group Relative Policy Optimization (GRPO) algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
Evaluation Results
Evaluated on seed-tts-eval. GLM-TTS_RL achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
| Model | CER β | SIM β | Open-source |
|---|---|---|---|
| Seed-TTS | 1.12 | 79.6 | π No |
| CosyVoice2 | 1.38 | 75.7 | π Yes |
| F5-TTS | 1.53 | 76.0 | π Yes |
| GLM-TTS (Base) | 1.03 | 76.1 | π Yes |
| GLM-TTS_RL (Ours) | 0.89 | 76.4 | π Yes |
Quick Start
Installation
git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
cd GLM-TTS
pip install -r requirements.txt
Command Line Inference
python glmtts_inference.py \
--data=example_zh \
--exp_name=_test \
--use_cache \
# --phoneme # Add this flag to enable phoneme capabilities.
Shell Script Inference
bash glmtts_inference.sh
Acknowledgments & Citation
We thank the following open-source projects for their support:
- CosyVoice - Providing frontend processing framework and high-quality vocoder
- Llama - Providing basic language model architecture
- Vocos - Providing high-quality vocoder
- GRPO-Zero - Reinforcement learning algorithm implementation inspiration
If you use GLM-TTS in your research, please cite:
@misc{glmtts2025,
title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
author={CogAudio Group Members},
year={2025},
publisher={Zhipu AI Inc}
}