UniME

Textual Discriminative Knowledge Distillation

To enhance the MLLM's embedding capability, we propose textual knowledge distillation from NV-Embed V2 (a strong LLM-based embedding model). The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen. During inference, the vision encoder is reintegrated for multimodal processing: unimodal inputs (text or image) use modality-specific prompts, whereas interleaved inputs generate the final representation by summing text and image embeddings.

Hard Negative Enhanced Instruction Tuning

This framework enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold (α = cos(e_q,e_c⁺) + β) to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty. The approach implements a three-stage process where false negatives are first filtered when similarity exceeds α, followed by selection of the most challenging negatives per batch, culminating in application of a modified contrastive loss focused exclusively on these carefully curated difficult cases.

Key Benefit:

Delivers enhanced model discrimination capabilities while preserving training efficiency through intelligent sample selection.

MMEB Results
Diverse Retrieval Results

Results on the MMEB (Massive Multimodal Embedding Benchmark). IND represents the in-distribution dataset, and OOD represents the out-of-distribution dataset. The reported scores are the average Precision@1 over the corresponding datasets. The best results are marked in bold. ^†: UniME with textual discrimination distillation only. ^‡: UniME with both textual discrimination distillation and hard negative enhanced instruction tuning.

Models	#Parameters	Per Meta-Task Score				Average Score
Models	#Parameters	Classification	VQA	Retrieval	Grounding	IND	OOD	Overall
# of Datasets →		10	10	12	4	20	16	36
Zero-shot on MMEB
CLIP(ViT-L)	0.4B	42.8	9.1	53.0	51.8	37.1	38.7	39.2
OpenCLIP(ViT-L)	0.4B	41.5	6.9	44.6	53.5	32.8	36.0	36.6
Magiclens(ViT-L)	0.4B	38.8	8.3	35.4	26.0	31.0	23.7	27.1
SigLIP(So/14)	0.9B	40.3	8.4	31.6	59.5	32.3	38.0	35.0
BLIP2(ViT-L)	1.4B	27.0	4.2	33.9	47.0	25.3	25.1	28.0
CLIP(ViT-BigG/14)	2.5B	52.3	14.0	50.5	60.3	38.9	45.8	44.3
EVA-CLIP	8B	56.0	10.4	49.2	58.9	38.1	45.6	43.7
E5-V(Phi3.5-V)	4.2B	39.1	9.6	38.0	57.6	33.1	31.9	36.1
E5-V(LLaVA-1.6)	7B	39.7	10.8	39.4	60.2	34.2	33.4	37.5
UniME^†(Phi3.5-V)	4.2B	42.5(+3.4)	18.3(+8.7)	40.5(+2.5)	59.9(+2.3)	36.0(+2.9)	38.3(+6.4)	40.3(+4.2)
UniME^†(LLaVA-1.6)	7B	43.0(+3.3)	17.7(+6.9)	42.5(+3.1)	63.2(+3.0)	37.6(+3.4)	38.6(+5.2)	41.6(+4.1)
Fine-tuning on MMEB
CLIP(ViT-L)	0.4B	55.2	19.7	53.2	62.2	47.6	42.8	47.6
VLM2Vec(Phi3.5-V)	4.2B	54.8	54.9	62.3	79.5	66.5	52.0	62.9
VLM2Vec(LLaVA-1.6)	7B	56.8	50.4	63.3	82.6	64.9	53.6	63.3
UniME^‡(Phi3.5-V)	4.2B	54.8(+0.0)	55.9(+1.0)	64.5(+2.2)	81.8(+2.3)	68.2(+1.7)	52.7(+0.7)	64.2(+1.3)
UniME^‡(LLaVA-1.6)	7B	60.6(+3.8)	52.9(+2.5)	67.9(+4.6)	85.1(+2.5)	68.4(+3.5)	57.9(+4.0)	66.6(+3.3)
UniME^‡(LLaVA-OneVision)	8B	66.8	66.6	70.5	90.9	74.6	65.8	70.7

Results of zero-shot text-image retrieval on short caption datasets (Flickr30K and MS-COCO), long caption datasets (ShareGPT4V and Urban1K) and compositional benchmark (SugarCrepe). The reported scores are the average Recall@1 over the corresponding datasets. The best results are marked in bold. ^†: UniME with textual discrimination distillation only. ^‡: UniME with both textual discrimination distillation and hard negative enhanced instruction tuning.

Models	#Parameters	Short Caption Retrieval				Long Caption Retrieval				Compositional Retrieval
Models	#Parameters	Flickr30K		COCO		ShareGPT4V		Urban1K		SugarCrepe
		qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	qⁱ→c^t	q^t→cⁱ	Replace	Swap	Add
OpenCLIP(ViT-L)	0.4B	67.3	87.2	37.0	58.1	81.8	84.0	47.0	47.0	79.5	62.7	74.9
CLIP(ViT-BigG/14)	2.5B	79.5	92.9	51.3	67.3	90.1	93.6	77.8	80.7	86.5	68.9	88.4
EVA-CLIP	8B	80.3	94.5	52.0	70.1	93.1	91.2	80.4	77.8	85.9	70.3	86.7
E5-V(Phi3.5-V)	4.2B	72.2	79.6	44.7	53.4	86.0	88.5	83.8	83.6	88.2	66.6	75.3
E5-V(LLaVA-1.6)	7B	77.3	85.7	49.1	57.6	85.1	82.1	88.9	83.2	86.3	68.7	66.9
UniME^†(Phi3.5-V)	4.2B	72.0(-0.2)	80.6(+1.0)	44.9(+0.2)	57.2(+0.8)	86.8(+3.8)	92.3(+1.3)	85.1(+2.3)	86.9(+3.3)	90.2(+2.0)	67.6(+1.0)	91.2(+15.9)
UniME^†(LLaVA-1.6)	7B	77.2(-0.1)	84.6(-1.1)	51.0(+1.9)	56.4(-1.2)	89.8(+4.7)	86.9(+4.8)	91.3(+2.4)	82.4(-0.8)	89.5(+3.2)	64.8(-3.9)	94.2(+27.3)
VLM2Vec(Phi3.5-V)	4.2B	68.7	83.0	43.7	59.8	90.1	92.0	87.9	86.8	86.2	66.7	84.2
VLM2Vec(LLaVA-1.6)	7B	76.0	90.6	46.8	66.6	85.8	90.7	84.7	90.8	85.8	66.3	86.5
UniME^‡(Phi3.5-V)	4.2B	77.0(+11.3)	88.2(+5.2)	49.8(+6.1)	66.8(+7.0)	92.1(+2.0)	96.4(+4.4)	92.7(+4.8)	95.1(+8.3)	90.1(+3.9)	70.9(+4.2)	93.3(+9.1)
UniME^‡(LLaVA-1.6)	7B	81.9(+5.9)	93.4(+2.8)	53.7(+6.1)	70.1(+3.5)	93.9(+8.1)	97.2(+6.5)	95.2(+10.5)	95.9(+5.1)	89.0(+3.2)	71.5(+5.2)	94.4(+7.9)

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃

@misc{gu2025breakingmodalitybarrieruniversal, title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng}, year={2025} }

Breaking the Modality Barrier : Universal Embedding Learning with Multimodal LLMs

Introduction

Key Components

Methodology

Textual Discriminative Knowledge Distillation

Hard Negative Enhanced Instruction Tuning

Main Results

Qualitative Analysis

BibTeX