CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.

Build and Development Commands

# Build (always build release for performance testing)
cargo build --release

# Run linter (MANDATORY before commits - catches many issues)
cargo clippy -- -D warnings

# Run tests
cargo test

# Run specific test
cargo test test_name

# Run benchmarks (Criterion-based)
cargo bench

# Run specific benchmark
cargo bench --bench mel_spectrogram
cargo bench --bench inference

# Check compilation without building
cargo check

# Format code
cargo fmt

# Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
cargo build --release && cargo clippy -- -D warnings && cargo build --release

CLI Usage

# Show help
./target/release/indextts --help

# Synthesize speech
./target/release/indextts synthesize \
  --text "Hello world" \
  --voice examples/voice_01.wav \
  --output output.wav

# Generate default config
./target/release/indextts init-config -o config.yaml

# Show system info
./target/release/indextts info

# Run built-in benchmarks
./target/release/indextts benchmark --iterations 100

Architecture

The codebase follows a modular pipeline architecture where each stage processes data sequentially:

Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output

Core Modules (src/)

audio/ - Audio DSP operations
- mel.rs - Mel-spectrogram computation (STFT, filterbanks)
- io.rs - WAV file I/O using hound
- dsp.rs - Signal processing utilities
- resample.rs - Audio resampling using rubato
text/ - Text processing pipeline
- normalizer.rs - Text normalization (Chinese/English/mixed)
- tokenizer.rs - BPE tokenization via HuggingFace tokenizers
- phoneme.rs - Grapheme-to-phoneme conversion
model/ - Neural network inference
- session.rs - ONNX Runtime wrapper (load-dynamic feature)
- gpt.rs - GPT-based sequence generation
- embedding.rs - Speaker and emotion encoders
vocoder/ - Neural vocoding
- bigvgan.rs - BigVGAN waveform synthesis
- activations.rs - Snake/SnakeBeta activation functions
pipeline/ - TTS orchestration
- synthesis.rs - Main synthesis logic, coordinates all modules
config/ - Configuration management (YAML-based via serde)
error.rs - Error types using thiserror
lib.rs - Library entry point, exposes public API
main.rs - CLI entry point using clap

Key Constants (lib.rs)

pub const SAMPLE_RATE: u32 = 22050;  // Output audio sample rate
pub const N_MELS: usize = 80;        // Mel filterbank channels
pub const N_FFT: usize = 1024;       // FFT size
pub const HOP_LENGTH: usize = 256;   // STFT hop length

Dependencies Pattern

Audio: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
ML Inference: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
Text: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
Parallelism: rayon (data parallelism), tokio (async)
CLI: clap (derive), env_logger, indicatif

Important Notes

ONNX Runtime: Uses load-dynamic feature - requires ONNX Runtime library installed on system
Model Files: ONNX models go in models/ directory (not in git, download separately)
Reference Implementation: Python code in indextts - REMOVING - REF ONLY/ is kept for reference only
Performance: Release builds use LTO and single codegen-unit for maximum optimization
Audio Format: All internal processing at 22050 Hz, 80-band mel spectrograms

Testing Strategy

Unit tests inline in modules
Criterion benchmarks in benches/ for performance regression testing
Python regression tests in tests/ for end-to-end validation
Example audio files in examples/ for testing voice cloning

Missing Infrastructure (TODO)

No scripts/manage.sh yet (should include build, test, clean, docker controls)
No context.md yet for conversation continuity
No integration tests with actual ONNX models