Papers
arxiv:2509.23610

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Published on Sep 28, 2025
Β· Submitted by
Kai Li
on Oct 1, 2025
Authors:
,
,

Abstract

Dolphin, an efficient AVSS method, uses a dual-path lightweight video encoder and a lightweight encoder-decoder separator with global-local attention blocks to achieve high separation quality and significant computational efficiency.

AI-generated summary

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

Community

Paper submitter

🐬 Dolphin: Efficient Audio-Visual Speech Separation

Author's Introduction

Hi everyone! πŸ‘‹ We're excited to share Dolphin - our work on making audio-visual speech separation actually practical for real-world deployment.

🎯 What We Built

Dolphin separates target speech from noisy audio by leveraging lip movements. The key innovation: achieving SOTA quality while being 6Γ— faster and using 50% fewer parameters than previous methods.

Two main contributions:

  1. DP-LipCoder: A lightweight video encoder using vector quantization to extract discrete lip semantics. We distill knowledge from AV-HuBERT while keeping the model compact.

  2. Global-Local Attention: Multi-scale attention blocks that capture both long-range context (global) and fine-grained details (local heat diffusion) in a single pass - no iterative refinement needed!

πŸ“Š Results Snapshot

On VoxCeleb2:

  • βœ… 16.1 dB SI-SNRi (vs IIANet's 15.8 dB)
  • βœ… 51M params (vs 112M) - 54% reduction
  • βœ… 417G MACs (vs 1009G) - 59% less computation
  • βœ… 0.015s inference (vs 0.100s) - 6.8Γ— speedup

πŸš€ Try It Out

Thanks to the HF team for featuring our work! Feel free to ask questions - we're here to discuss. πŸ™Œ

Paper: arXiv:2509.23610

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23610 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1