IMBE-ASR Large (290M params, d=1024, 12 layers)

Speech recognition directly from IMBE vocoder parameters — skip audio reconstruction entirely.

Results

Evaluated on LibriSpeech-IMBE speaker-split validation (2,775 utterances). Input is 170-dim IMBE vocoder parameters at 4.4 kbps, not audio.

Decode method	WER	CER
Greedy	6.5%	1.9%
Beam + 5-gram KenLM (α=0.7, β=2.0)	3.35%	1.24%

Note: The original 1.9% WER result used a larger uncompressed 5-gram LM. The included lm/5gram.bin is a trie+q8 compressed version (1.3GB vs 4.2GB) and achieves 3.35% WER — still a substantial improvement over greedy. Use beam search with the included LM for best results.

Architecture

Parameter	Value
d_model	1024
Layers	12
Heads	16
d_ff	4096
Parameters	290M
Input dim	170 (raw IMBE params)
Vocab	40 (blank + A-Z + 0-9 + space + apostrophe)

Conformer-CTC with character-level CTC decoding. Trained on ~1,220 hours of IMBE-encoded speech (LibriSpeech 960h + TEDLIUM 3 + GigaSpeech S), 30 epochs on 2x RTX 3090 Ti.

Files

File	Format	Size	Notes
`model.safetensors`	SafeTensors	1.2 GB	PyTorch weights
`config.json`	JSON	—	Architecture config
`model.onnx`	ONNX fp32	1.1 GB	Full precision
`model_int8.onnx`	ONNX int8	298 MB	Quantized, Python ORT
`model_uint8.onnx`	ONNX uint8	312 MB	Quantized, C engine compatible
`stats.npz`	NumPy	2 KB	Normalization stats (required)
`lm/5gram.bin`	KenLM trie (5-gram, q8)	1.3 GB	Language model for beam search
`lm/unigrams.txt`	Vocabulary	9 MB	Unigrams for beam decoder

Edge Deployment (Raspberry Pi 5, 4GB)

Runtime	Format	10s call	RTF	RAM
C engine (70KB)	uint8 ONNX	2.8s	0.28x	~1.3 GB
Python ORT	int8 ONNX	3.5s	0.35x	535 MB
PyTorch	safetensors	OOM	—	>4 GB

Usage

Greedy decode (fast, no dependencies)

import onnxruntime as ort, numpy as np

session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
    "features": features.reshape(1, -1, 170),
    "lengths": np.array([features.shape[0]], dtype=np.int64),
})

Beam search + KenLM (recommended)

import onnxruntime as ort, numpy as np
from pyctcdecode import build_ctcdecoder

session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")

VOCAB = list(" ABCDEFGHIJKLMNOPQRSTUVWXYZ'")
labels = [""] + VOCAB
decoder = build_ctcdecoder(
    labels=labels,
    kenlm_model_path="lm/5gram.bin",
    unigrams=open("lm/unigrams.txt").read().splitlines(),
    alpha=0.7,   # LM weight — tuned on LibriSpeech-IMBE
    beta=2.0,    # word insertion bonus
)

features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
    "features": features.reshape(1, -1, 170),
    "lengths": np.array([features.shape[0]], dtype=np.int64),
})
text = decoder.decode(log_probs[0, :out_lengths[0]], beam_width=100)

Install dependencies: pip install pyctcdecode kenlm

How It Works

P25 radio uses the IMBE vocoder (4.4 kbps, proprietary DVSI codec). The open-source libimbe is a reverse-engineered approximation. Standard ASR pipelines reconstruct audio from codec parameters then extract features — losing information at every step.

We skip reconstruction. The 170-dim vocoder parameters (f0, spectral amplitudes, voicing flags, harmonic mask) already encode phonetic information. A Conformer-CTC model learns to read them directly.

Limitations

Trained on IMBE-encoded clean speech, not real P25 radio. See imbe-asr-base-512d-p25 for a P25 fine-tuned variant.
Character-level CTC — no subword tokenizer. Language model recommended for best results.
English only.

Citation

@misc{imbe-asr-2026,
  title={IMBE-ASR: Speech Recognition Directly from Vocoder Parameters},
  url={https://github.com/trunk-reporter/imbe-asr},
  year={2026}
}

Downloads last month: 79

Datasets used to train trunk-reporter/imbe-asr-large-1024d

Collection including trunk-reporter/imbe-asr-large-1024d

IMBE-ASR: Speech Recognition from Vocoder Parameters

Collection

ASR directly from P25 IMBE codec parameters. Skip audio reconstruction, go straight from the digital bitstream to text. 1.9%% WER on LibriSpeech-IMBE. • 3 items • Updated 13 days ago