UniME-v2

MLLM-as-a-Judge for Hard Negatives Mining

Motivated by the limitations of in-batch negative mining, which often yields low-diversity and low-quality negatives, we introduce an MLLM-as-a-Judge pipeline. This method first constructs a potential hard negative set via global retrieval. Then, a powerful MLLM is prompted to assess the semantic alignment of each query-candidate pair, generating a soft matching score.

Methodology

Global Retrieval: Use VLM2Vec to select top-50 candidates as potential hard negatives.
Semantic Scoring: Prompt an MLLM to evaluate query-candidate alignment and generate soft matching scores based on "Yes"/"No" token probabilities.
Filtering & Sampling: Exclude false negatives via thresholding and apply cyclical sampling to ensure diverse, high-quality hard negatives.

MLLM Judgment Based Training Framework

MLLM Judgment Based Training Framework addresses the limitation of rigid one-to-one mapping in traditional contrastive learning by leveraging MLLM-generated soft semantic scores as supervisory signals. The framework extracts embeddings from queries and candidates, then aligns the model's similarity distribution with the MLLM's semantic score distribution using JS-Divergence loss, while jointly optimizing a reranker through combined pairwise and listwise training. This approach enables the model to capture fine-grained semantic distinctions among candidates, significantly enhancing its discriminative capability for complex retrieval tasks.

Inference Pipeline

UniME-V2 generates embeddings for the query and candidates, using cosine similarity to retrieve the top-10 most relevant results.
UniME-V2-Reranker refines this shortlist by evaluating candidates against the query through instruction-based reasoning to produce the final ranked output.

MMEB Results
Diverse Retrieval Results
Rerank Model Results

Results on the MMEB (Massive Multimodal Embedding Benchmark). IND represents the in-distribution dataset, and OOD represents the out-of-distribution dataset. The reported scores are the average Precision@1 over the corresponding datasets. The best results are marked in bold.

Models	#Parameters	Per Meta-Task Score				Average Score
Models	#Parameters	Classification	VQA	Retrieval	Grounding	IND	OOD	Overall
# of Datasets →		10	10	12	4	20	16	36
Zero-shot on MMEB
CLIP(ViT-L)	0.4B	42.8	9.1	53.0	51.8	37.1	38.7	39.2
OpenCLIP(ViT-L)	0.4B	41.5	6.9	44.6	53.5	32.8	36.0	36.6
Magiclens(ViT-L)	0.4B	38.8	8.3	35.4	26.0	31.0	23.7	27.1
SigLIP(So/14)	0.9B	40.3	8.4	31.6	59.5	32.3	38.0	35.0
BLIP2(ViT-L)	1.4B	27.0	4.2	33.9	47.0	25.3	25.1	28.0
CLIP(ViT-BigG/14)	2.5B	52.3	14.0	50.5	60.3	38.9	45.8	44.3
EVA-CLIP	7B	56.0	10.4	49.2	58.9	38.1	45.6	43.7
E5-V(Phi3.5-V)	4.2B	39.1	9.6	38.0	57.6	33.1	31.9	36.1
E5-V(LLaVA-1.6)	7B	39.7	10.8	39.4	60.2	34.2	33.4	37.5
Fine-tuning on MMEB
CLIP(ViT-L)	0.4B	55.2	19.7	53.2	62.2	47.6	42.8	47.6
VLM2Vec(Qwen2-VL)	2B	59.0	49.4	65.4	73.4	66.0	52.6	60.1
VLM2Vec(Qwen2-VL)	7B	62.6	57.8	69.9	81.7	72.2	57.8	65.8
LLaVE(LLaVA-OneVision)	7B	65.7	65.4	70.9	91.9	75.0	64.4	70.3
QQMM(LLaVA-OneVision)	7B	66.8	66.8	70.5	90.4	74.7	65.6	70.7
UniME(Qwen2-VL)	2B	59.0	53.4	64.9	69.6	65.5	54.6	60.6
UniME(Qwen2-VL)	7B	64.7	59.0	71.6	82.7	72.2	61.4	67.4
UniME(LLaVA-OneVision)	7B	66.8	66.6	70.5	90.9	74.6	65.8	70.7
UniME-V2 (Qwen2-VL)	2B	62.1(+3.1)	56.3(+2.9)	68.0(+3.1)	72.7(+3.1)	67.4(+1.9)	58.9(+4.3)	63.6(+3.0)
UniME-V2 (Qwen2-VL)	7B	64.0(-0.7)	60.1(+1.1)	73.1(+1.5)	82.8(+0.1)	72.0(-0.2)	63.0(+1.6)	68.0(+0.6)
UniME-V2 (LLaVA-OneVision)	7B	65.3(-1.5)	67.6(+1.0)	72.9(+2.4)	90.2(-0.7)	74.8(+0.2)	66.7(+0.9)	71.2(+0.5)

Results of zero-shot text-image retrieval on short caption datasets (Flickr30K and MS-COCO), long caption datasets (ShareGPT4V and Urban1K) and compositional benchmark (SugarCrepe). The reported scores are the average Recall@1 over the corresponding datasets. The best results are marked in bold.

Models	#Parameters	Short Caption Retrieval				Long Caption Retrieval				Compositional Retrieval
Models	#Parameters	Flickr30K		COCO		ShareGPT4V		Urban1K		SugarCrepe
		qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	Replace	Swap	Add
OpenCLIP(ViT-L)	0.4B	67.3	87.2	37.0	58.1	81.8	84.0	47.0	47.0	79.5	62.7	74.9
CLIP(ViT-BigG/14)	2.5B	79.5	92.9	51.3	67.3	90.1	93.6	77.8	80.7	86.5	68.9	88.4
EVA-CLIP	8B	80.3	94.5	52.0	70.1	93.1	91.2	80.4	77.8	85.9	70.3	86.7
E5-V(Phi3.5-V)	4.2B	72.2	79.6	44.7	53.4	86.0	88.5	83.8	83.6	88.2	66.6	75.3
E5-V (LLaVA-1.6)	7B	77.3	85.7	49.1	57.6	85.1	82.1	88.9	83.2	86.3	68.7	66.9
VLM2Vec (Qwen2-VL)	2B	69.3	89.6	40.0	62.5	78.1	88.2	78.7	83.9	67.2	46.5	66.4
VLM2Vec (Qwen2-VL)	7B	80.0	94.2	49.2	68.5	78.5	90.4	94.0	94.2	70.0	51.7	72.2
UniME (Qwen2-VL)	2B	74.9	90.6	44.0	63.5	83.6	88.6	83.3	83.2	65.6	45.2	65.7
UniME (Qwen2-VL)	7B	80.8	92.7	50.9	69.8	86.5	93.8	95.3	94.0	68.8	53.0	69.8
UniME (LLaVA-OneVision)	7B	83.3	94.4	54.8	74.0	93.9	89.3	94.3	95.5	80.5	65.5	82.2
UniME-V2 (Qwen2-VL)	2B	79.8(+4.9)	89.9(-0.7)	53.7(+9.7)	65.1(+1.6)	91.6(+8.0)	94.2(+5.6)	95.6(+12.3)	92.2(+9.0)	70.9(+5.3)	51.2(+6.0)	70.2(+4.5)
UniME-V2 (Qwen2-VL)	7B	84.6(+3.8)	93.5(+0.8)	57.3(+6.4)	70.3(+0.5)	94.3(+0.8)	95.2(+1.4)	97.2(+1.9)	96.3(+2.3)	77.8(+9.0)	62.2(+9.2)	79.0(+9.2)
UniME-V2 (LLaVA-OneVision)	7B	85.5(+2.2)	93.7(-0.7)	60.9(+6.1)	74.1(+0.1)	95.1(+1.2)	94.1(+4.8)	96.3(+2.0)	96.7(+1.2)	88.6(+8.1)	73.7(+8.2)	90.5(+8.3)

As shown in Table, UniME-V2-Reranker consistently outperforms LamRA in listwise reranking across four tasks, using the same base model and training setup. With only half the data, it achieves superior results, especially excelling in compositional understanding retrieval—gaining up to 7.4% in one task.

Embedding Model	Reranker	#Data	MMEB	Short Caption Retrieval	Long Caption Retrieval	Compositional Retrieval
UniME(Qwen2-VL-2B)	---	---	60.6	68.3	84.7	58.8
UniME-V2(Qwen2-VL-2B)	---	---	63.6	72.1	93.4	64.1
UniME-V2(Qwen2-VL-2B)	LamRA(Qwen2.5VL-7B)	1.1M	67.3	76.4	96.4	87.4
UniME-V2(Qwen2-VL-2B)	UniME-V2-Reranker(Qwen2.5-VL-7B)	0.6M	67.6	76.4	96.9	94.8
UniME(Qwen2-VL-7B)	---	---	67.4	73.6	92.4	63.9
UniME-V2(Qwen2-VL-7B)	---	---	68.0	76.4	95.8	73.0
UniME-V2(Qwen2-VL-7B)	LamRA(Qwen2.5VL-7B)	1.1M	69.1	78.3	97.2	87.4
UniME-V2(Qwen2-VL-7B)	UniME-V2-Reranker(Qwen2.5-VL-7B)	0.6M	69.6	78.7	97.5	94.8

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃

@misc{gu2025unimev2mllmasajudgeuniversalmultimodal, title={UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning}, author={Tiancheng Gu and Kaicheng Yang and Kaichen Zhang and Xiang An and Ziyong Feng and Yueyi Zhang and Weidong Cai and Jiankang Deng and Lidong Bing}, year={2025}, eprint={2510.13515}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13515}, } @inproceedings{unime, title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, author={Gu, Tiancheng and Yang, Kaicheng and Feng, Ziyong and Wang, Xingjun and Zhang, Yanzhao and Long, Dingkun and Chen, Yingda and Cai, Weidong and Deng, Jiankang}, booktitle={ACM MM}, year={2025} }

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Introduction

Key Components

Methodology

MLLM-as-a-Judge for Hard Negatives Mining

Methodology

MLLM Judgment Based Training Framework

Inference Pipeline

Main Results

Qualitative Analysis

Figure 1: T-SNE anlysis

Figure 2: Qualitative Examples

BibTeX