# RzenEmbed-v2-7B

RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025).


[![arXiv](https://img.shields.io/badge/arXiv-2510.27350-b31b1b.svg)](https://arxiv.org/abs/2510.27350)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/360CVGroup/RzenEmbed)
[![Benchmark](https://img.shields.io/badge/MMEB-Benchmark-blue.svg)](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard)

### MMEB-V2

| Model                    | Model Size (B) | Overall   | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | --------- | ------------- | ------------- | -------------- |
| RzenEmbed-v2-7B          | 8.29           | **71.61** | 75.92         | **55.73**     | **77.06**      |
| seed-1.6-embedding       | unknown        | 71.27     | **77.78**     | 55.34         | 73.44          |
| Ops-MM-embedding-v1-7B   | 8.29           | 67.61     | 72.72         | 53.76         | 70.34          |
| Ops-MM-embedding-v1-2B   | 2.21           | 63.44     | 69.03         | 47.56         | 66.96          |
| interestFM-UIR-CAFe-7B   | 8.03           | 60.63     | 67.56         | 42.4          | 63.92          |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21           | 58.02     | 64.85         | 34.85         | 65.36          |
| gme-Qwen2-VL-7B-Instruct | 8.29           | 57.83     | 55.95         | 38.43         | 75.18          |
| gme-Qwen2-VL-2B-Instruct | 2.21           | 54.08     | 51.89         | 33.64         | 72.71          |

### MMEB-Image

| Models                 | Model Size(B) | Image-Overall | I-CLS     | I-QA      | I-RET    | I-VG     |
| ---------------------- | ------------- | ------------- | --------- | --------- | -------- | -------- |
| seed-1.6-embedding     | unknown       | **77.78**     | **76.06** | **73.97** | 77.9     | 91.25    |
| RzenEmbed-v2-7B        | 8.29          | 75.92         | 70.61     | 71.67     | **78.5** | **92.1** |
| QQMM-embed-v2          | 8.29          | 75.28         | 72.97     | 71.85     | 76.01    | 87.42    |
| ReCo-7B                | 8.29          | 73.87         | 70.95     | 71.52     | 73.66    | 87.70    |
| OEmbedding-v1-7B       | 8.29          | 72.79         | 70.05     | 68.1      | 73.84    | 88.25    |
| Ops-MM-embedding-v1-7B | 8.29          | 72.72         | 69.65     | 69.58     | 73.09    | 87.15    |
| QQMM-embed             | 8.29          | 72.18         | 70.07     | 69.52     | 71.18    | 87.08    |
| B3_Qwen2_7B            | 8.29          | 72.00         | 70.00     | 66.50     | 74.10    | 84.60    |

### MMEB-Video

| Models                   | Model Size(B) | Video-Overall | V-CLS     | V-QA     | V-RET     | V-MRET    |
| ------------------------ | ------------- | ------------- | --------- | -------- | --------- | --------- |
| RzenEmbed-v2-7B          | 8.29          | **55.73**     | 58.82     | **63.5** | 50.97     | 45.54     |
| seed-1.6-embedding       | unknown       | 55.34         | 54.99     | 60.85    | **51.33** | **53.45** |
| Ops-MM-embedding-v1-7B   | 8.29          | 53.76         | **59.68** | 62.22    | 45.72     | 43.21     |
| interestFM-UIR-CAFe-7B   | 8.03          | 42.40         | 35.81     | 58.66    | 34.44     | 39.53     |
| gme-Qwen2-VL-7B-Instruct | 8.29          | 38.43         | 37.44     | 50.35    | 28.37     | 36.96     |
| interestFM-UIR-CAFe-0.5B | 0.89          | 35.87         | 33.90     | 41.72    | 29.69     | 39.69     |
| LamRA-Ret                | 8.29          | 34.96         | 39.27     | 42.6     | 24.26     | 32.84     |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21          | 34.58         | 39.30     | 34.32    | 28.77     | 36.82     |

### MMEB-Visdoc

| Models                   | Model Size(B) | Visdoc-Overall | ViDoRe-V1 | ViDoRe-V2 | VisRAG   | VisDoc-OOD |
| ------------------------ | ------------- | -------------- | --------- | --------- | -------- | ---------- |
| RzenEmbed-v2-7B          | 8.29          | **77.06**      | **89.7**  | **60.7**  | **88.7** | 44.38      |
| gme-Qwen2-VL-7B-Instruct | 8.29          | 75.18          | 89.44     | 55.61     | 84.99    | **44.4**   |
| seed-1.6-embedding       | unknown       | 73.44          | 85.53     | 56.57     | 84.74    | 43.14      |
| gme-Qwen2-VL-2B-Instruct | 2.21          | 72.71          | 86.15     | 53.96     | 82.52    | 43.12      |
| colpali-v1.3             | 2.92          | 70.97          | 83.60     | 51.98     | 81.13    | 43.12      |
| Ops-MM-embedding-v1-7B   | 8.29          | 70.34          | 80.05     | 59.59     | 79.32    | 43.34      |
| Ops-MM-embedding-v1-2B   | 2.21          | 66.96          | 76.39     | 53.18     | 77.64    | 41.17      |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21          | 65.36          | 75.52     | 44.86     | 79.38    | 39.43      |

## Usage

### Text-to-Image Retrieval

Retrieve images that match text captions.

```python
from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]
candidates = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]

query_instruction = "Find me an everyday image that matches the given caption: "
candidate_instruction = "Represent the given image."

# Generate embeddings and compute similarity
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-image similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```

### Image-to-Text Retrieval

Find text captions that best match given images.

```python
from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]
candidates = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]

query_instruction = "Find an image caption describing the given everyday image."

query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries)
candidate_embeds = rzen.get_fused_embeddings(texts=candidates)

# Calculate image-to-text similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```

### Document Retrieval

Match text queries with document images for information retrieval.

```python
from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "What is the main variable being analyzed on the x-axis of these graphs?",
    "What is the personnel costs in the 4th year?"
]
candidates = [
    "assets/example3.jpg",
    "assets/example4.jpg",
]

query_instruction = "Find a document image that matches the given query: "
candidate_instruction = "Understand the content of the provided document image."

# Generate embeddings for document retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-document similarity
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```

### Video Retrieval

Retrieve videos based on text captions.

```python
import cv2
import numpy as np
from rzen_embed_inference import RzenEmbed

def extract_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
        else:
            break
    cap.release()
    return frames

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.",
    "Tiny ginger kitten meows cutely by the water."
]

# Extract frames from videos
video_path_list = [
    "assets/example5.mp4",
    "assets/example6.mp4",
]
candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list]

query_instruction = "Find the video snippet that corresponds to the given caption: "
candidate_instruction = "Understand the content of the provided video."

# Generate embeddings for video retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-video similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```

## Citation
If you find RzenEmbed useful for your research and applications, please cite using this BibTeX:

```
@article{jian2025rzenembed,
  title={RzenEmbed: Towards Comprehensive Multimodal Retrieval},
  author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2510.27350},
  year={2025}
}
```