IndexTTS-Rust / CLAUDE.md
ThreadAbort's picture
Refactor: Remove internationalization (i18n) support and related files
e3e7558

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.

Build and Development Commands

# Build (always build release for performance testing)
cargo build --release

# Run linter (MANDATORY before commits - catches many issues)
cargo clippy -- -D warnings

# Run tests
cargo test

# Run specific test
cargo test test_name

# Run benchmarks (Criterion-based)
cargo bench

# Run specific benchmark
cargo bench --bench mel_spectrogram
cargo bench --bench inference

# Check compilation without building
cargo check

# Format code
cargo fmt

# Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
cargo build --release && cargo clippy -- -D warnings && cargo build --release

CLI Usage

# Show help
./target/release/indextts --help

# Synthesize speech
./target/release/indextts synthesize \
  --text "Hello world" \
  --voice examples/voice_01.wav \
  --output output.wav

# Generate default config
./target/release/indextts init-config -o config.yaml

# Show system info
./target/release/indextts info

# Run built-in benchmarks
./target/release/indextts benchmark --iterations 100

Architecture

The codebase follows a modular pipeline architecture where each stage processes data sequentially:

Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output

Core Modules (src/)

  • audio/ - Audio DSP operations

    • mel.rs - Mel-spectrogram computation (STFT, filterbanks)
    • io.rs - WAV file I/O using hound
    • dsp.rs - Signal processing utilities
    • resample.rs - Audio resampling using rubato
  • text/ - Text processing pipeline

    • normalizer.rs - Text normalization (Chinese/English/mixed)
    • tokenizer.rs - BPE tokenization via HuggingFace tokenizers
    • phoneme.rs - Grapheme-to-phoneme conversion
  • model/ - Neural network inference

    • session.rs - ONNX Runtime wrapper (load-dynamic feature)
    • gpt.rs - GPT-based sequence generation
    • embedding.rs - Speaker and emotion encoders
  • vocoder/ - Neural vocoding

    • bigvgan.rs - BigVGAN waveform synthesis
    • activations.rs - Snake/SnakeBeta activation functions
  • pipeline/ - TTS orchestration

    • synthesis.rs - Main synthesis logic, coordinates all modules
  • config/ - Configuration management (YAML-based via serde)

  • error.rs - Error types using thiserror

  • lib.rs - Library entry point, exposes public API

  • main.rs - CLI entry point using clap

Key Constants (lib.rs)

pub const SAMPLE_RATE: u32 = 22050;  // Output audio sample rate
pub const N_MELS: usize = 80;        // Mel filterbank channels
pub const N_FFT: usize = 1024;       // FFT size
pub const HOP_LENGTH: usize = 256;   // STFT hop length

Dependencies Pattern

  • Audio: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
  • ML Inference: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
  • Text: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
  • Parallelism: rayon (data parallelism), tokio (async)
  • CLI: clap (derive), env_logger, indicatif

Important Notes

  1. ONNX Runtime: Uses load-dynamic feature - requires ONNX Runtime library installed on system
  2. Model Files: ONNX models go in models/ directory (not in git, download separately)
  3. Reference Implementation: Python code in indextts - REMOVING - REF ONLY/ is kept for reference only
  4. Performance: Release builds use LTO and single codegen-unit for maximum optimization
  5. Audio Format: All internal processing at 22050 Hz, 80-band mel spectrograms

Testing Strategy

  • Unit tests inline in modules
  • Criterion benchmarks in benches/ for performance regression testing
  • Python regression tests in tests/ for end-to-end validation
  • Example audio files in examples/ for testing voice cloning

Missing Infrastructure (TODO)

  • No scripts/manage.sh yet (should include build, test, clean, docker controls)
  • No context.md yet for conversation continuity
  • No integration tests with actual ONNX models