#
MuQ & MuQ-MuLan
This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
with Mel Residual Vector **Q**uantization"*.
In this repo, the following models are released:
- **MuQ**: A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
- **MuQ-MuLan**: A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
## Overview
We develop the **MuQ** for music SSL. MuQ applys our proposed Mel-RVQ as quantitative targets and achieves SOTA performance on many music understanding (or MIR) tasks.
We also construct the **MuQ-MuLan**, a CLIP-like model trained by contrastive learning, which jointly represents music and text into embeddings.
For more details, please refer to our [paper](https://arxiv.org/abs/2501.01108).
## Usage
To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
```bash
pip3 install muq
```
To extract music audio features using **MuQ**, you can refer to the following code:
```python
import torch, librosa
from muq import MuQ
device = 'cuda'
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
wavs = torch.tensor(wav).unsqueeze(0).to(device)
# This will automatically fetch the checkpoint from huggingface
muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
muq = muq.to(device).eval()
with torch.no_grad():
output = muq(wavs, output_hidden_states=True)
print('Total number of layers: ', len(output.hidden_states))
print('Feature shape: ', output.last_hidden_state.shape)
```
Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
```python
import torch, librosa
from muq import MuQMuLan
# This will automatically fetch checkpoints from huggingface
device = 'cuda'
mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
mulan = mulan.to(device).eval()
# Extract music embeddings
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
wavs = torch.tensor(wav).unsqueeze(0).to(device)
with torch.no_grad():
audio_embeds = mulan(wavs = wavs)
# Extract text embeddings (texts can be in English or Chinese)
texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲,节奏欢快"]
with torch.no_grad():
text_embeds = mulan(texts = texts)
# Calculate dot product similarity
sim = mulan.calc_similarity(audio_embeds, text_embeds)
print(sim)
```
> Note that both MuQ and MuQ-MuLan strictly require **24 kHz** audio as input.
> We recommend using **fp32** during MuQ inference to avoid potential NaN issues.
## Performance
## Model Checkpoints
| Model Name | Parameters | Data | HuggingFace🤗 |
| ----------- | --- | --- | ----------- |
| MuQ | ~300M | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter) |
| MuQ-MuLan | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large) |
**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper. The training recipes can be found [here](./src/recipes).
## License
The code in this repository is released under the MIT license as found in the [LICENSE](LICENSE) file.
The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) in this repository are released under the CC-BY-NC 4.0 license, as detailed in the [LICENSE_weights](LICENSE_weights) file.
## Citation
```
@article{zhu2025muq,
title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
journal={arXiv preprint arXiv:2501.01108},
year={2025}
}
```
## Acknowledgement
We borrow many codes from the following repositories:
- [lucidrains/musiclm-pytorch](https://github.com/lucidrains/musiclm-pytorch)
- [minzwon/musicfm](https://github.com/minzwon/musicfm)
Also, we are especially grateful to the awesome [MARBLE-Benchmark](https://github.com/a43992899/MARBLE-Benchmark).