IMBE-ASR Large (290M params, d=1024, 12 layers)
Speech recognition directly from IMBE vocoder parameters — skip audio reconstruction entirely.
Code: trunk-reporter/imbe-asr
Results
Evaluated on LibriSpeech-IMBE speaker-split validation (2,775 utterances). Input is 170-dim IMBE vocoder parameters at 4.4 kbps, not audio.
| Decode method | WER | CER |
|---|---|---|
| Greedy | 6.5% | 1.9% |
| Beam + 5-gram KenLM (α=0.7, β=2.0) | 3.35% | 1.24% |
Note: The original 1.9% WER result used a larger uncompressed 5-gram LM. The included
lm/5gram.binis a trie+q8 compressed version (1.3GB vs 4.2GB) and achieves 3.35% WER — still a substantial improvement over greedy. Use beam search with the included LM for best results.
Architecture
| Parameter | Value |
|---|---|
| d_model | 1024 |
| Layers | 12 |
| Heads | 16 |
| d_ff | 4096 |
| Parameters | 290M |
| Input dim | 170 (raw IMBE params) |
| Vocab | 40 (blank + A-Z + 0-9 + space + apostrophe) |
Conformer-CTC with character-level CTC decoding. Trained on ~1,220 hours of IMBE-encoded speech (LibriSpeech 960h + TEDLIUM 3 + GigaSpeech S), 30 epochs on 2x RTX 3090 Ti.
Files
| File | Format | Size | Notes |
|---|---|---|---|
model.safetensors |
SafeTensors | 1.2 GB | PyTorch weights |
config.json |
JSON | — | Architecture config |
model.onnx |
ONNX fp32 | 1.1 GB | Full precision |
model_int8.onnx |
ONNX int8 | 298 MB | Quantized, Python ORT |
model_uint8.onnx |
ONNX uint8 | 312 MB | Quantized, C engine compatible |
stats.npz |
NumPy | 2 KB | Normalization stats (required) |
lm/5gram.bin |
KenLM trie (5-gram, q8) | 1.3 GB | Language model for beam search |
lm/unigrams.txt |
Vocabulary | 9 MB | Unigrams for beam decoder |
Edge Deployment (Raspberry Pi 5, 4GB)
| Runtime | Format | 10s call | RTF | RAM |
|---|---|---|---|---|
| C engine (70KB) | uint8 ONNX | 2.8s | 0.28x | ~1.3 GB |
| Python ORT | int8 ONNX | 3.5s | 0.35x | 535 MB |
| PyTorch | safetensors | OOM | — | >4 GB |
Usage
Greedy decode (fast, no dependencies)
import onnxruntime as ort, numpy as np
session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
"features": features.reshape(1, -1, 170),
"lengths": np.array([features.shape[0]], dtype=np.int64),
})
Beam search + KenLM (recommended)
import onnxruntime as ort, numpy as np
from pyctcdecode import build_ctcdecoder
session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
VOCAB = list(" ABCDEFGHIJKLMNOPQRSTUVWXYZ'")
labels = [""] + VOCAB
decoder = build_ctcdecoder(
labels=labels,
kenlm_model_path="lm/5gram.bin",
unigrams=open("lm/unigrams.txt").read().splitlines(),
alpha=0.7, # LM weight — tuned on LibriSpeech-IMBE
beta=2.0, # word insertion bonus
)
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
"features": features.reshape(1, -1, 170),
"lengths": np.array([features.shape[0]], dtype=np.int64),
})
text = decoder.decode(log_probs[0, :out_lengths[0]], beam_width=100)
Install dependencies: pip install pyctcdecode kenlm
How It Works
P25 radio uses the IMBE vocoder (4.4 kbps, proprietary DVSI codec). The open-source libimbe is a reverse-engineered approximation. Standard ASR pipelines reconstruct audio from codec parameters then extract features — losing information at every step.
We skip reconstruction. The 170-dim vocoder parameters (f0, spectral amplitudes, voicing flags, harmonic mask) already encode phonetic information. A Conformer-CTC model learns to read them directly.
Limitations
- Trained on IMBE-encoded clean speech, not real P25 radio. See imbe-asr-base-512d-p25 for a P25 fine-tuned variant.
- Character-level CTC — no subword tokenizer. Language model recommended for best results.
- English only.
Citation
@misc{imbe-asr-2026,
title={IMBE-ASR: Speech Recognition Directly from Vocoder Parameters},
url={https://github.com/trunk-reporter/imbe-asr},
year={2026}
}
- Downloads last month
- 79