# RzenEmbed-v2-7B RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025). [![arXiv](https://img.shields.io/badge/arXiv-2510.27350-b31b1b.svg)](https://arxiv.org/abs/2510.27350) [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/360CVGroup/RzenEmbed) [![Benchmark](https://img.shields.io/badge/MMEB-Benchmark-blue.svg)](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) ### MMEB-V2 | Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall | | ------------------------ | -------------- | --------- | ------------- | ------------- | -------------- | | RzenEmbed-v2-7B | 8.29 | **71.61** | 75.92 | **55.73** | **77.06** | | seed-1.6-embedding | unknown | 71.27 | **77.78** | 55.34 | 73.44 | | Ops-MM-embedding-v1-7B | 8.29 | 67.61 | 72.72 | 53.76 | 70.34 | | Ops-MM-embedding-v1-2B | 2.21 | 63.44 | 69.03 | 47.56 | 66.96 | | interestFM-UIR-CAFe-7B | 8.03 | 60.63 | 67.56 | 42.4 | 63.92 | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.02 | 64.85 | 34.85 | 65.36 | | gme-Qwen2-VL-7B-Instruct | 8.29 | 57.83 | 55.95 | 38.43 | 75.18 | | gme-Qwen2-VL-2B-Instruct | 2.21 | 54.08 | 51.89 | 33.64 | 72.71 | ### MMEB-Image | Models | Model Size(B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG | | ---------------------- | ------------- | ------------- | --------- | --------- | -------- | -------- | | seed-1.6-embedding | unknown | **77.78** | **76.06** | **73.97** | 77.9 | 91.25 | | RzenEmbed-v2-7B | 8.29 | 75.92 | 70.61 | 71.67 | **78.5** | **92.1** | | QQMM-embed-v2 | 8.29 | 75.28 | 72.97 | 71.85 | 76.01 | 87.42 | | ReCo-7B | 8.29 | 73.87 | 70.95 | 71.52 | 73.66 | 87.70 | | OEmbedding-v1-7B | 8.29 | 72.79 | 70.05 | 68.1 | 73.84 | 88.25 | | Ops-MM-embedding-v1-7B | 8.29 | 72.72 | 69.65 | 69.58 | 73.09 | 87.15 | | QQMM-embed | 8.29 | 72.18 | 70.07 | 69.52 | 71.18 | 87.08 | | B3_Qwen2_7B | 8.29 | 72.00 | 70.00 | 66.50 | 74.10 | 84.60 | ### MMEB-Video | Models | Model Size(B) | Video-Overall | V-CLS | V-QA | V-RET | V-MRET | | ------------------------ | ------------- | ------------- | --------- | -------- | --------- | --------- | | RzenEmbed-v2-7B | 8.29 | **55.73** | 58.82 | **63.5** | 50.97 | 45.54 | | seed-1.6-embedding | unknown | 55.34 | 54.99 | 60.85 | **51.33** | **53.45** | | Ops-MM-embedding-v1-7B | 8.29 | 53.76 | **59.68** | 62.22 | 45.72 | 43.21 | | interestFM-UIR-CAFe-7B | 8.03 | 42.40 | 35.81 | 58.66 | 34.44 | 39.53 | | gme-Qwen2-VL-7B-Instruct | 8.29 | 38.43 | 37.44 | 50.35 | 28.37 | 36.96 | | interestFM-UIR-CAFe-0.5B | 0.89 | 35.87 | 33.90 | 41.72 | 29.69 | 39.69 | | LamRA-Ret | 8.29 | 34.96 | 39.27 | 42.6 | 24.26 | 32.84 | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 34.58 | 39.30 | 34.32 | 28.77 | 36.82 | ### MMEB-Visdoc | Models | Model Size(B) | Visdoc-Overall | ViDoRe-V1 | ViDoRe-V2 | VisRAG | VisDoc-OOD | | ------------------------ | ------------- | -------------- | --------- | --------- | -------- | ---------- | | RzenEmbed-v2-7B | 8.29 | **77.06** | **89.7** | **60.7** | **88.7** | 44.38 | | gme-Qwen2-VL-7B-Instruct | 8.29 | 75.18 | 89.44 | 55.61 | 84.99 | **44.4** | | seed-1.6-embedding | unknown | 73.44 | 85.53 | 56.57 | 84.74 | 43.14 | | gme-Qwen2-VL-2B-Instruct | 2.21 | 72.71 | 86.15 | 53.96 | 82.52 | 43.12 | | colpali-v1.3 | 2.92 | 70.97 | 83.60 | 51.98 | 81.13 | 43.12 | | Ops-MM-embedding-v1-7B | 8.29 | 70.34 | 80.05 | 59.59 | 79.32 | 43.34 | | Ops-MM-embedding-v1-2B | 2.21 | 66.96 | 76.39 | 53.18 | 77.64 | 41.17 | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 65.36 | 75.52 | 44.86 | 79.38 | 39.43 | ## Usage ### Text-to-Image Retrieval Retrieve images that match text captions. ```python from rzen_embed_inference import RzenEmbed rzen = RzenEmbed("qihoo360/RzenEmbed") queries = [ "A curious kitten and a gentle puppy share a moment of connection on the grass.", "Fresh fridge full of berries yogurt milk and snacks." ] candidates = [ "assets/example1.jpg", "assets/example2.jpg", ] query_instruction = "Find me an everyday image that matches the given caption: " candidate_instruction = "Represent the given image." # Generate embeddings and compute similarity query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) # Calculate text-to-image similarity scores similarity_scores = query_embeds @ candidate_embeds.T print(similarity_scores) ``` ### Image-to-Text Retrieval Find text captions that best match given images. ```python from rzen_embed_inference import RzenEmbed rzen = RzenEmbed("qihoo360/RzenEmbed") queries = [ "assets/example1.jpg", "assets/example2.jpg", ] candidates = [ "A curious kitten and a gentle puppy share a moment of connection on the grass.", "Fresh fridge full of berries yogurt milk and snacks." ] query_instruction = "Find an image caption describing the given everyday image." query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries) candidate_embeds = rzen.get_fused_embeddings(texts=candidates) # Calculate image-to-text similarity scores similarity_scores = query_embeds @ candidate_embeds.T print(similarity_scores) ``` ### Document Retrieval Match text queries with document images for information retrieval. ```python from rzen_embed_inference import RzenEmbed rzen = RzenEmbed("qihoo360/RzenEmbed") queries = [ "What is the main variable being analyzed on the x-axis of these graphs?", "What is the personnel costs in the 4th year?" ] candidates = [ "assets/example3.jpg", "assets/example4.jpg", ] query_instruction = "Find a document image that matches the given query: " candidate_instruction = "Understand the content of the provided document image." # Generate embeddings for document retrieval query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) # Calculate text-to-document similarity similarity_scores = query_embeds @ candidate_embeds.T print(similarity_scores) ``` ### Video Retrieval Retrieve videos based on text captions. ```python import cv2 import numpy as np from rzen_embed_inference import RzenEmbed def extract_frames(video_path, num_frames): cap = cv2.VideoCapture(video_path) total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) frames = [] for idx in frame_indices: cap.set(cv2.CAP_PROP_POS_FRAMES, idx) ret, frame = cap.read() if ret: frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))) else: break cap.release() return frames rzen = RzenEmbed("qihoo360/RzenEmbed") queries = [ "A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.", "Tiny ginger kitten meows cutely by the water." ] # Extract frames from videos video_path_list = [ "assets/example5.mp4", "assets/example6.mp4", ] candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list] query_instruction = "Find the video snippet that corresponds to the given caption: " candidate_instruction = "Understand the content of the provided video." # Generate embeddings for video retrieval query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) # Calculate text-to-video similarity scores similarity_scores = query_embeds @ candidate_embeds.T print(similarity_scores) ``` ## Citation If you find RzenEmbed useful for your research and applications, please cite using this BibTeX: ``` @article{jian2025rzenembed, title={RzenEmbed: Towards Comprehensive Multimodal Retrieval}, author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui}, journal={arXiv preprint arXiv:2510.27350}, year={2025} } ```