Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Paper β’ 2604.10708 β’ Published β’ 40
Unified Audio Understanding, Generation, and Editing (SIGGRAPH 2026)
Audio-Omni is the first end-to-end framework that unifies understanding, generation, and editing across general sound, music, and speech domains. It combines a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
# Clone the GitHub repository
git clone https://github.com/ZeyueT/Audio-Omni.git
cd Audio-Omni
# Install dependencies
pip install -e .
conda install -c conda-forge ffmpeg libsndfile
# Download model from Hugging Face
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
from audio_omni import AudioOmni
import torchaudio
# Load model
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
# 1. Understanding
response = model.understand(
"Describe the sounds in this audio.",
audio="example.wav"
)
print(response)
# 2. Generation (Text-to-Audio)
audio = model.generate("T2A", prompt="A clock ticking.")
torchaudio.save("output.wav", audio, model.sample_rate)
# 3. Editing (Add a sound)
audio = model.edit("Add", "input.wav", desc="skateboarding")
torchaudio.save("output_add.wav", audio, model.sample_rate)
Audio-Omni.json β Model configurationmodel.ckpt β Model checkpoint (~21 GB)synchformer_state_dict.pth β Synchformer checkpoint for video conditioning# Launch interactive demo
python run_gradio.py \
--model-config model/Audio-Omni.json \
--ckpt-path model/model.ckpt \
--server-port 7777
@article{tian2026audio,
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and others},
journal={arXiv preprint arXiv:2604.10708},
year={2026}
}
CC-BY-NC-4.0 (Non-commercial use only). Commercial use of the model weights requires explicit written authorization from the authors. For commercial licensing inquiries, contact: ztianad@connect.ust.hk