UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

1MiroMind AI
2The University of Sydney
3M.R.L. Team
4LMMs-Lab Team
5Imperial College London
* Equal contribution † Corresponding author

Introduction

Overview Illustration

Universal multimodal embedding models are fundamental for tasks like visual question answering and cross-modal retrieval. While models like CLIP have set a strong baseline, they often struggle with capturing fine-grained semantic differences and handling diverse, hard negative samples. This paper introduces UniME-V2, a novel framework that leverages the advanced understanding capabilities of Multimodal Large Language Models (MLLMs) to significantly enhance the learning of universal multimodal embeddings.

Key Components

  1. Hard Negative Mining: An MLLM acts as a judge to score candidate pairs, enabling the identification of high-quality, diverse hard negatives.
  2. UniME-V2: The model is trained by aligning its similarity scores with the MLLM's soft semantic labels to learn fine-grained distinctions.
  3. UniME-V2-Reranker: A separate model is trained on the mined hard negatives using a combined pairwise and listwise loss for precise reranking.

UniME-V2 achieves state-of-the-art results on the MMEB benchmark, outperforming all baseline models across in-distribution, out-of-distribution, and specialized retrieval tasks. The UniME-V2-Reranker provides further performance gains, establishing a new benchmark for universal multimodal embedding.

Methodology

MLLM-as-a-Judge for Hard Negatives Mining

Motivated by the limitations of in-batch negative mining, which often yields low-diversity and low-quality negatives, we introduce an MLLM-as-a-Judge pipeline. This method first constructs a potential hard negative set via global retrieval. Then, a powerful MLLM is prompted to assess the semantic alignment of each query-candidate pair, generating a soft matching score.

Methodology

  1. Global Retrieval: Use VLM2Vec to select top-50 candidates as potential hard negatives.
  2. Semantic Scoring: Prompt an MLLM to evaluate query-candidate alignment and generate soft matching scores based on "Yes"/"No" token probabilities.
  3. Filtering & Sampling: Exclude false negatives via thresholding and apply cyclical sampling to ensure diverse, high-quality hard negatives.

MLLM Judgment Based Training Framework

MLLM Judgment Based Training Framework addresses the limitation of rigid one-to-one mapping in traditional contrastive learning by leveraging MLLM-generated soft semantic scores as supervisory signals. The framework extracts embeddings from queries and candidates, then aligns the model's similarity distribution with the MLLM's semantic score distribution using JS-Divergence loss, while jointly optimizing a reranker through combined pairwise and listwise training. This approach enables the model to capture fine-grained semantic distinctions among candidates, significantly enhancing its discriminative capability for complex retrieval tasks.

Inference Pipeline

  1. UniME-V2 generates embeddings for the query and candidates, using cosine similarity to retrieve the top-10 most relevant results.
  2. UniME-V2-Reranker refines this shortlist by evaluating candidates against the query through instruction-based reasoning to produce the final ranked output.

Main Results

Results on the MMEB (Massive Multimodal Embedding Benchmark). IND represents the in-distribution dataset, and OOD represents the out-of-distribution dataset. The reported scores are the average Precision@1 over the corresponding datasets. The best results are marked in bold.

Models #Parameters Per Meta-Task Score Average Score
Classification VQA Retrieval Grounding IND OOD Overall
# of Datasets → 10 10 12 4 20 16 36
Zero-shot on MMEB
CLIP(ViT-L) 0.4B 42.8 9.1 53.0 51.8 37.1 38.7 39.2
OpenCLIP(ViT-L) 0.4B 41.5 6.9 44.6 53.5 32.8 36.0 36.6
Magiclens(ViT-L) 0.4B 38.8 8.3 35.4 26.0 31.0 23.7 27.1
SigLIP(So/14) 0.9B 40.3 8.4 31.6 59.5 32.3 38.0 35.0
BLIP2(ViT-L) 1.4B 27.0 4.2 33.9 47.0 25.3 25.1 28.0
CLIP(ViT-BigG/14) 2.5B 52.3 14.0 50.5 60.3 38.9 45.8 44.3
EVA-CLIP 7B 56.0 10.4 49.2 58.9 38.1 45.6 43.7
E5-V(Phi3.5-V) 4.2B 39.1 9.6 38.0 57.6 33.1 31.9 36.1
E5-V(LLaVA-1.6) 7B 39.7 10.8 39.4 60.2 34.2 33.4 37.5
Fine-tuning on MMEB
CLIP(ViT-L) 0.4B 55.2 19.7 53.2 62.2 47.6 42.8 47.6
VLM2Vec(Qwen2-VL) 2B 59.0 49.4 65.4 73.4 66.0 52.6 60.1
VLM2Vec(Qwen2-VL) 7B 62.6 57.8 69.9 81.7 72.2 57.8 65.8
LLaVE(LLaVA-OneVision) 7B 65.7 65.4 70.9 91.9 75.0 64.4 70.3
QQMM(LLaVA-OneVision) 7B 66.8 66.8 70.5 90.4 74.7 65.6 70.7
UniME(Qwen2-VL) 2B 59.0 53.4 64.9 69.6 65.5 54.6 60.6
UniME(Qwen2-VL) 7B 64.7 59.0 71.6 82.7 72.2 61.4 67.4
UniME(LLaVA-OneVision) 7B 66.8 66.6 70.5 90.9 74.6 65.8 70.7
UniME-V2 (Qwen2-VL) 2B 62.1(+3.1) 56.3(+2.9) 68.0(+3.1) 72.7(+3.1) 67.4(+1.9) 58.9(+4.3) 63.6(+3.0)
UniME-V2 (Qwen2-VL) 7B 64.0(-0.7) 60.1(+1.1) 73.1(+1.5) 82.8(+0.1) 72.0(-0.2) 63.0(+1.6) 68.0(+0.6)
UniME-V2 (LLaVA-OneVision) 7B 65.3(-1.5) 67.6(+1.0) 72.9(+2.4) 90.2(-0.7) 74.8(+0.2) 66.7(+0.9) 71.2(+0.5)

Results of zero-shot text-image retrieval on short caption datasets (Flickr30K and MS-COCO), long caption datasets (ShareGPT4V and Urban1K) and compositional benchmark (SugarCrepe). The reported scores are the average Recall@1 over the corresponding datasets. The best results are marked in bold.

Models #Parameters Short Caption Retrieval Long Caption Retrieval Compositional Retrieval
Flickr30K COCO ShareGPT4V Urban1K SugarCrepe
qi→ct qt→ci qi→ct qt→ci qi→ct qt→ci qi→ct qt→ci Replace Swap Add
OpenCLIP(ViT-L) 0.4B 67.3 87.2 37.0 58.1 81.8 84.0 47.0 47.0 79.5 62.7 74.9
CLIP(ViT-BigG/14) 2.5B 79.5 92.9 51.3 67.3 90.1 93.6 77.8 80.7 86.5 68.9 88.4
EVA-CLIP 8B 80.3 94.5 52.0 70.1 93.1 91.2 80.4 77.8 85.9 70.3 86.7
E5-V(Phi3.5-V) 4.2B 72.2 79.6 44.7 53.4 86.0 88.5 83.8 83.6 88.2 66.6 75.3
E5-V (LLaVA-1.6) 7B 77.3 85.7 49.1 57.6 85.1 82.1 88.9 83.2 86.3 68.7 66.9
VLM2Vec (Qwen2-VL) 2B 69.3 89.6 40.0 62.5 78.1 88.2 78.7 83.9 67.2 46.5 66.4
VLM2Vec (Qwen2-VL) 7B 80.0 94.2 49.2 68.5 78.5 90.4 94.0 94.2 70.0 51.7 72.2
UniME (Qwen2-VL) 2B 74.9 90.6 44.0 63.5 83.6 88.6 83.3 83.2 65.6 45.2 65.7
UniME (Qwen2-VL) 7B 80.8 92.7 50.9 69.8 86.5 93.8 95.3 94.0 68.8 53.0 69.8
UniME (LLaVA-OneVision) 7B 83.3 94.4 54.8 74.0 93.9 89.3 94.3 95.5 80.5 65.5 82.2
UniME-V2 (Qwen2-VL) 2B 79.8(+4.9) 89.9(-0.7) 53.7(+9.7) 65.1(+1.6) 91.6(+8.0) 94.2(+5.6) 95.6(+12.3) 92.2(+9.0) 70.9(+5.3) 51.2(+6.0) 70.2(+4.5)
UniME-V2 (Qwen2-VL) 7B 84.6(+3.8) 93.5(+0.8) 57.3(+6.4) 70.3(+0.5) 94.3(+0.8) 95.2(+1.4) 97.2(+1.9) 96.3(+2.3) 77.8(+9.0) 62.2(+9.2) 79.0(+9.2)
UniME-V2 (LLaVA-OneVision) 7B 85.5(+2.2) 93.7(-0.7) 60.9(+6.1) 74.1(+0.1) 95.1(+1.2) 94.1(+4.8) 96.3(+2.0) 96.7(+1.2) 88.6(+8.1) 73.7(+8.2) 90.5(+8.3)

As shown in Table, UniME-V2-Reranker consistently outperforms LamRA in listwise reranking across four tasks, using the same base model and training setup. With only half the data, it achieves superior results, especially excelling in compositional understanding retrieval—gaining up to 7.4% in one task.

Embedding Model Reranker #Data MMEB Short Caption Retrieval Long Caption Retrieval Compositional Retrieval
UniME(Qwen2-VL-2B) --- --- 60.6 68.3 84.7 58.8
UniME-V2(Qwen2-VL-2B) --- --- 63.6 72.1 93.4 64.1
UniME-V2(Qwen2-VL-2B) LamRA(Qwen2.5VL-7B) 1.1M 67.3 76.4 96.4 87.4
UniME-V2(Qwen2-VL-2B) UniME-V2-Reranker(Qwen2.5-VL-7B) 0.6M 67.6 76.4 96.9 94.8
UniME(Qwen2-VL-7B) --- --- 67.4 73.6 92.4 63.9
UniME-V2(Qwen2-VL-7B) --- --- 68.0 76.4 95.8 73.0
UniME-V2(Qwen2-VL-7B) LamRA(Qwen2.5VL-7B) 1.1M 69.1 78.3 97.2 87.4
UniME-V2(Qwen2-VL-7B) UniME-V2-Reranker(Qwen2.5-VL-7B) 0.6M 69.6 78.7 97.5 94.8

Qualitative Analysis

t-SNE可视化
Figure 1: Comparison of representation distributions between EVA-CLIP-8B (left) and UniME-V2 (LLaVA-OneVision-7B) (right).
语义可视化
Figure 2: We present the retrieval and reranking results of our method across different tasks.

Figure 1: T-SNE anlysis

UniME-V2 achieves superior zero-shot cross-modal retrieval performance, showing significant improvements on long-caption tasks and more robust results compared to EVA-CLIP-8B. As show in figures(left), this enhanced performance is primarily because UniME-V2's universal multimodal embedding effectively reduces the modality gap between text and images, creating a more aligned and unified representation space.

Figure 2: Qualitative Examples

The qualitative examples demonstrate the effectiveness of our two-stage retrieval pipeline. The visualization shows that UniME-V2 successfully retrieves semantically relevant candidates (e.g., both "black bear" and "brown bear" for a bear query), while UniME-V2-Reranker further refines the ranking to select the most accurate candidate (e.g., prioritizing "brown bear"). This superior performance is achieved because the reranker leverages MLLM's advanced reasoning capability to make finer-grained semantic distinctions that are challenging for embedding-based retrieval alone.

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


      @misc{gu2025unimev2mllmasajudgeuniversalmultimodal,
            title={UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning}, 
            author={Tiancheng Gu and Kaicheng Yang and Kaichen Zhang and Xiang An and Ziyong Feng and Yueyi Zhang and Weidong Cai and Jiankang Deng and Lidong Bing},
            year={2025},
            eprint={2510.13515},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2510.13515}, 
      }

      @inproceedings{unime,
            title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs},
            author={Gu, Tiancheng and Yang, Kaicheng and Feng, Ziyong and Wang, Xingjun and Zhang, Yanzhao and Long, Dingkun and Chen, Yingda and Cai, Weidong and Deng, Jiankang},
            booktitle={ACM MM},
            year={2025}
      }