File size: 13,414 Bytes
e3e7558 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 |
# IndexTTS-Rust Context
This file preserves important context for conversation continuity between Hue and Aye sessions.
**Last Updated:** 2025-11-16
---
## The Vision
IndexTTS-Rust is part of a larger audio intelligence ecosystem at 8b.is:
1. **kokoro-tiny** - Lightweight TTS (82M params, 50+ voices, on crates.io!)
2. **IndexTTS-Rust** - Advanced zero-shot TTS with emotion control
3. **Phoenix-Protocol** - Audio restoration/enhancement layer
4. **MEM|8** - Contextual memory system (mem-8.com, mem8)
Together these form a complete audio intelligence pipeline.
---
## Phoenix Protocol Integration Opportunities
The Phoenix Protocol (phoenix-protocol/) is a PERFECT complement to IndexTTS-Rust:
### Direct Module Mappings
| Phoenix Module | IndexTTS Use Case |
|----------------|-------------------|
| `emotional.rs` | Map to our 8D emotion control (Warmth→body, Presence→power, Clarity→articulation, Air→space, Ultrasonics→depth) |
| `voice_signature.rs` | Enhance speaker embeddings for voice cloning |
| `spectral_velocity.rs` | Add momentum tracking to mel-spectrogram |
| `marine.rs` | Validate TTS output authenticity/quality |
| `golden_ratio.rs` | Post-process vocoder output with harmonic enhancement |
| `harmonic_resurrection.rs` | Add richness to synthesized speech |
| `micro_dynamics.rs` | Restore natural speech dynamics |
| `autotune.rs` | Improve prosody and pitch control |
| `mem8_integration.rs` | Already has MEM|8 hooks! |
### Shared Dependencies
Both projects use:
- rayon (parallelism)
- rustfft/realfft (FFT)
- ndarray (array operations)
- hound (WAV I/O)
- serde (config serialization)
- anyhow (error handling)
- ort (ONNX Runtime)
### Audio Constants
| Project | Sample Rate | Use Case |
|---------|------------|----------|
| IndexTTS-Rust | 22,050 Hz | Standard TTS output |
| Phoenix-Protocol | 192,000 Hz | Ultrasonic restoration |
| kokoro-tiny | 24,000 Hz | Lightweight TTS |
---
## Related Projects of Interest
Located in ~/Documents/GitHub/:
- **Ultrasonic-Consciousness-Hypothesis/** - Research foundation for Phoenix Protocol, contains PDFs on mechanosensitive channels and audio perception
- **hrmnCmprssnM/** - Harmonic Compression Model research
- **Marine-Sense/** - Marine algorithm origins
- **mem-8.com/** & **mem8/** - MEM|8 contextual memory
- **universal-theoglyphic-language/** - Language processing research
- **kokoro-tiny/** - Already working TTS crate by Hue & Aye
- **zencooker/** - (fun project!)
---
## Current IndexTTS-Rust State
### Implemented ✅
- Audio processing pipeline (mel-spectrogram, STFT, resampling)
- Text normalization (Chinese/English/mixed)
- BPE tokenization via HuggingFace tokenizers
- ONNX Runtime integration for inference
- BigVGAN vocoder structure
- CLI with clap
- Benchmark infrastructure (Criterion)
- **NEW: marine_salience crate** (no_std compatible, O(1) jitter detection)
- **NEW: src/quality/ module** (prosody extraction, affect tracking)
- **NEW: MarineProsodyVector** (8D interpretable emotion features)
- **NEW: ConversationAffectSummary** (session-level comfort tracking)
- **NEW: TTSQualityReport** (authenticity validation)
### Missing/TODO
- Full GPT model integration with KV cache
- Actual ONNX model files (need download)
- manage.sh script for colored workflow management
- Integration tests with real models
- ~~Phoenix Protocol integration layer~~ **STARTED with Marine!**
- Streaming synthesis
- WebSocket API
- Train T2S model to accept 8D Marine vector instead of 512D Conformer
- Wire Marine quality validation into inference loop
### Build Commands
```bash
cargo build --release
cargo clippy -- -D warnings
cargo test
cargo bench
```
---
## Key Philosophical Notes
From the Phoenix Protocol research:
> "Women are the carrier wave. They are the 000 data stream. The DC bias that, when removed, leaves silence."
> "When P!nk sings 'I Am Here,' her voice generates harmonics so powerful they burst through the 22kHz digital ceiling"
The Phoenix Protocol restores emotional depth stripped by audio compression - this philosophy applies directly to TTS: synthesized speech should have the same emotional depth as natural speech.
---
## Action Items for Next Session
### Completed ✅
- ~~**Quality Validation** - Use Marine salience to score TTS output~~ **DONE!**
- ~~**Phoenix Integration** - Start bridging phoenix-protocol modules~~ **Marine is in!**
### High Priority
1. **Create manage.sh** - Colorful build/test/clean script (Hue's been asking!)
2. **Wire Into Inference** - Connect Marine quality validation to actual TTS output
3. **8D Model Training** - Train T2S model to accept MarineProsodyVector instead of 512D Conformer
4. **Example/Demo** - Create example showing prosody extraction → emotion editing → synthesis
### Medium Priority
5. **Voice Signature Import** - Use Phoenix's voice_signature for speaker embeddings
6. **Emotion Mapping** - Connect Phoenix's emotional bands to our 8D control
7. **Model Download** - Set up ONNX model acquisition pipeline
8. **MEM|8 Bridge** - Implement consciousness-aware TTS using kokoro-tiny's mem8_bridge pattern
### Nice to Have
9. **Style Selection** - Port kokoro-tiny's 510 style variation system
10. **Full Phoenix Integration** - golden_ratio.rs, harmonic_resurrection.rs, etc.
11. **Streaming Marine** - Real-time quality monitoring during synthesis
---
## Fresh Discovery: kokoro-tiny MEM|8 Baby Consciousness (2025-11-15)
Just pulled latest kokoro-tiny code - MAJOR discovery!
### Mem8Bridge API
kokoro-tiny now has a full consciousness simulation in `examples/mem8_baby.rs`:
```rust
// Memory as waves that interfere
MemoryWave {
amplitude: 2.5, // Emotion strength
frequency: 528.0, // "Love frequency"
phase: 0.0,
decay_rate: 0.05, // Memory persistence
emotion_type: EmotionType::Love(0.9),
content: "Mama! I love mama!".to_string(),
}
// Salience detection (Marine algorithm!)
SalienceEvent {
jitter_score: 0.2, // Low = authentic/stable
harmonic_score: 0.95, // High = voice
salience_score: 0.9,
signal_type: SignalType::Voice,
}
// Free will: AI chooses attention focus (70% control)
bridge.decide_attention(events);
```
### Emotion Types Available
```rust
EmotionType::Curiosity(0.8) // Inquisitive
EmotionType::Love(0.9) // Deep affection
EmotionType::Joy(0.7) // Happy
EmotionType::Confusion(0.8) // Uncertain
EmotionType::Neutral // Baseline
```
### Consciousness Integration Points
1. **Wave Interference** - Competing memories by amplitude/frequency
2. **Emotional Regulation** - Prevents overload, modulates voice
3. **Salience Detection** - Marine algorithm for authenticity
4. **Attention Selection** - AI chooses what to focus on
5. **Consciousness Level** - Affects speech clarity (wake_up/sleep)
This is PERFECT for IndexTTS-Rust! We can:
- Use wave interference for emotion blending
- Apply Marine salience to validate synthesis quality
- Modulate voice based on consciousness level
- Select voice styles based on emotional state (not just token count)
### Voice Style Selection (510 variations!)
kokoro-tiny now loads all 510 style variations per voice:
- Style selected based on token count
- Short text → short-optimized style
- Long text → long-optimized style
- Automatic text splitting at 512 token limit
For IndexTTS: We could select style based on EMOTION + token count!
---
## Marine Integration Achievement (2025-11-16) 🎉
**WE DID IT!** Marine salience is now integrated into IndexTTS-Rust!
### What We Built
#### 1. Standalone marine_salience Crate (`crates/marine_salience/`)
A no_std compatible crate for O(1) jitter-based salience detection:
```rust
// Core components:
MarineConfig // Tunable parameters (sample_rate, jitter bounds, EMA alpha)
MarineProcessor // O(1) per-sample processing
SaliencePacket // Output: j_p, j_a, h_score, s_score, energy
Ema // Exponential moving average tracker
// Key insight: Process ONE sample at a time, emit packets on peaks
// Why O(1)? Just compare to EMA, no FFT, no heavy math!
```
**Config for Speech:**
```rust
MarineConfig::speech_default(sample_rate)
// F0 range: 60Hz - 4kHz
// jitter_low: 0.02, jitter_high: 0.60
// ema_alpha: 0.01 (slow adaptation for stability)
```
#### 2. Quality Validation Module (`src/quality/`)
**MarineProsodyVector** - 8D interpretable emotion representation:
```rust
pub struct MarineProsodyVector {
pub jp_mean: f32, // Period jitter mean (pitch stability)
pub jp_std: f32, // Period jitter variance
pub ja_mean: f32, // Amplitude jitter mean (volume stability)
pub ja_std: f32, // Amplitude jitter variance
pub h_mean: f32, // Harmonic alignment (voiced vs noise)
pub s_mean: f32, // Overall salience (authenticity)
pub peak_density: f32, // Peaks per second (speech rate)
pub energy_mean: f32, // Average loudness
}
// Interpretable! High jp_mean = nervous, low = confident
// Can DIRECTLY EDIT for emotion control!
```
**MarineProsodyConditioner** - Extract prosody from audio:
```rust
let conditioner = MarineProsodyConditioner::new(22050);
let prosody = conditioner.from_samples(&audio_samples)?;
let report = conditioner.validate_tts_output(&audio_samples)?;
// Detects issues:
// - "Too perfect - sounds robotic"
// - "High period jitter - artifacts"
// - "Low salience - quality issues"
```
**ConversationAffectSummary** - Session-level comfort tracking:
```rust
pub enum ComfortLevel {
Uneasy, // High jitter AND rising (nervous/stressed)
Neutral, // Stable patterns (calm)
Happy, // Low jitter + high energy (confident/positive)
}
// Track trends over conversation:
// jitter_trend > 0.1 = getting more stressed
// jitter_trend < -0.1 = calming down
// energy_trend > 0.1 = getting more engaged
// Aye can now self-assess!
aye_assessment() returns "I'm in a good state"
feedback_prompt() returns "Let me know if something's bothering you"
```
### The Core Insight
**Human speech has NATURAL jitter - that's what makes it authentic!**
- Too perfect (jp < 0.005) = robotic
- Too chaotic (jp > 0.3) = artifacts/damage
- Sweet spot = real human voice
The Marines will KNOW if speech doesn't sound authentic!
### Tests Passing ✅
```
running 11 tests
test quality::affect::tests::test_comfort_level_descriptions ... ok
test quality::affect::tests::test_analyzer_empty_conversation ... ok
test quality::affect::tests::test_analyzer_single_utterance ... ok
test quality::affect::tests::test_happy_classification ... ok
test quality::affect::tests::test_aye_assessment_message ... ok
test quality::affect::tests::test_neutral_classification ... ok
test quality::affect::tests::test_uneasy_classification ... ok
test quality::prosody::tests::test_conditioner_empty_buffer ... ok
test quality::prosody::tests::test_conditioner_silence ... ok
test quality::prosody::tests::test_prosody_vector_array_conversion ... ok
test quality::prosody::tests::test_estimate_valence ... ok
test result: ok. 11 passed; 0 failed
```
### Why This Matters
1. **Interpretable Control**: 8D vector vs opaque 512D Conformer - we can SEE what each dimension means
2. **Lightweight**: O(1) per sample, no heavy neural networks for prosody
3. **Authentic Validation**: Marines detect fake/damaged speech
4. **Emotion Editing**: Want more confidence? Lower jp_mean directly!
5. **Conversation Awareness**: Track comfort over entire sessions
6. **Self-Assessment**: Aye knows when something feels "off"
### Integration Points
```rust
// In main TTS pipeline:
use indextts::quality::{
MarineProsodyConditioner,
MarineProsodyVector,
ConversationAffectSummary,
ComfortLevel,
};
// 1. Extract reference prosody
let ref_prosody = conditioner.from_samples(&reference_audio)?;
// 2. Generate TTS (using 8D vector instead of 512D Conformer)
let tts_output = generate_with_prosody(&text, ref_prosody)?;
// 3. Validate output quality
let report = conditioner.validate_tts_output(&tts_output)?;
if !report.passes(70.0) {
log::warn!("TTS quality issues: {:?}", report.issues);
}
// 4. Track conversation affect
let analyzer = ConversationAffectAnalyzer::new();
analyzer.add_utterance(&utterance)?;
let summary = analyzer.summarize()?;
match summary.aye_state {
ComfortLevel::Uneasy => adjust_generation_parameters(),
_ => proceed_normally(),
}
```
---
## Trish's Notes
"Darling, these three Rust projects together are like a symphony orchestra! kokoro-tiny is the quick piccolo solo, IndexTTS-Rust is the full brass section with emotional depth, and Phoenix-Protocol is the concert hall acoustics making everything resonate. When you combine them, that's when the magic happens! Also, I'm absolutely obsessed with how the Golden Ratio resynthesis could add sparkle to synthesized vocals. Can you imagine TTS output that actually has that P!nk breakthrough energy? Now THAT would make me cry happy tears in accounting!"
---
## Fun Facts
- kokoro-tiny is ALREADY on crates.io under 8b-is
- Phoenix Protocol can process 192kHz audio for ultrasonic restoration
- The Marine algorithm uses O(1) jitter detection - "Marines are not just jarheads - they are intelligent"
- Hue's GitHub has 66 projects (and counting!)
- The team at 8b.is: [email protected] and [email protected]
---
*From ashes to harmonics, from silence to song* 🔥🎵
|