GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS

πŸ’» GitHub Repository   |   πŸ› οΈ Audio.Z.AI

πŸ“– Model Introduction

GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.

By introducing a Multi-Reward Reinforcement Learning framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.

Key Features

  • Zero-shot Voice Cloning: Clone any speaker's voice with just 3-10 seconds of prompt audio.
  • RL-enhanced Emotion Control: Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
  • High-quality Synthesis: Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
  • Phoneme-level Control: Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
  • Streaming Inference: Supports real-time audio generation suitable for interactive applications.
  • Bilingual Support: Optimized for Chinese and English mixed text.

System Architecture

GLM-TTS follows a two-stage design:

  1. Stage 1 (LLM): A Llama-based model converts input text into speech token sequences.
  2. Stage 2 (Flow Matching): A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.
GLM-TTS Architecture

Reinforcement Learning Alignment

To tackle flat emotional expression, GLM-TTS uses a Group Relative Policy Optimization (GRPO) algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.

Evaluation Results

Evaluated on seed-tts-eval. GLM-TTS_RL achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.

Model CER ↓ SIM ↑ Open-source
Seed-TTS 1.12 79.6 πŸ”’ No
CosyVoice2 1.38 75.7 πŸ‘ Yes
F5-TTS 1.53 76.0 πŸ‘ Yes
GLM-TTS (Base) 1.03 76.1 πŸ‘ Yes
GLM-TTS_RL (Ours) 0.89 76.4 πŸ‘ Yes

Quick Start

Installation

git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
cd GLM-TTS
pip install -r requirements.txt

Command Line Inference

python glmtts_inference.py \
    --data=example_zh \
    --exp_name=_test \
    --use_cache \
    # --phoneme # Add this flag to enable phoneme capabilities.

Shell Script Inference

bash glmtts_inference.sh

Acknowledgments & Citation

We thank the following open-source projects for their support:

  • CosyVoice - Providing frontend processing framework and high-quality vocoder
  • Llama - Providing basic language model architecture
  • Vocos - Providing high-quality vocoder
  • GRPO-Zero - Reinforcement learning algorithm implementation inspiration

If you use GLM-TTS in your research, please cite:

@misc{glmtts2025,
  title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
  author={CogAudio Group Members},
  year={2025},
  publisher={Zhipu AI Inc}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support