Breaking the Modality Barrier : Universal Embedding Learning with Multimodal LLMs

1The university of sydney
2DeepGlint
3Tongyi Lab, Alibaba Group
4Imperial College London
* Equal contribution † Corresponding author

Introduction

Overview Illustration

The CLIP framework excels in multimodal representation learning but suffers from three limitations: (1) Text token truncation; (2) Isolated image-text encoding; (3) Weak compositionality. While Multimodal Large Language Models (MLLMs) show strong vision-language understanding, their potential for transferable representations remains underutilized. We propose UniME (Universal Multimodal Embedding), a two-stage framework that enhances MLLMs for discriminative representation learning:

Key Components

  1. Textual discriminative knowledge distillation from an LLM-based teacher to improve the MLLM's language embeddings
  2. Hard negative-enhanced instruction tuning, which mitigates false negatives and samples challenging negatives per batch

Experiments on MMEB (Massive Multimodal Embedding Benchmark) and diverse retrieval tasks (short/long caption + compositional) show that UniME achieves consistent improvements, demonstrating superior discriminative and compositional capabilities.

Methodology

Textual Discriminative Knowledge Distillation

To enhance the MLLM's embedding capability, we propose textual knowledge distillation from NV-Embed V2 (a strong LLM-based embedding model). The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen. During inference, the vision encoder is reintegrated for multimodal processing: unimodal inputs (text or image) use modality-specific prompts, whereas interleaved inputs generate the final representation by summing text and image embeddings.

Hard Negative Enhanced Instruction Tuning

This framework enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold (α = cos(e_q,e_c⁺) + β) to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty. The approach implements a three-stage process where false negatives are first filtered when similarity exceeds α, followed by selection of the most challenging negatives per batch, culminating in application of a modified contrastive loss focused exclusively on these carefully curated difficult cases.

Key Benefit:

Delivers enhanced model discrimination capabilities while preserving training efficiency through intelligent sample selection.

Main Results

Results on the MMEB (Massive Multimodal Embedding Benchmark). IND represents the in-distribution dataset, and OOD represents the out-of-distribution dataset. The reported scores are the average Precision@1 over the corresponding datasets. The best results are marked in bold. : UniME with textual discrimination distillation only. : UniME with both textual discrimination distillation and hard negative enhanced instruction tuning.

Models #Parameters Per Meta-Task Score Average Score
Classification VQA Retrieval Grounding IND OOD Overall
# of Datasets → 10 10 12 4 20 16 36
Zero-shot on MMEB
CLIP(ViT-L) 0.4B 42.8 9.1 53.0 51.8 37.1 38.7 39.2
OpenCLIP(ViT-L) 0.4B 41.5 6.9 44.6 53.5 32.8 36.0 36.6
Magiclens(ViT-L) 0.4B 38.8 8.3 35.4 26.0 31.0 23.7 27.1
SigLIP(So/14) 0.9B 40.3 8.4 31.6 59.5 32.3 38.0 35.0
BLIP2(ViT-L) 1.4B 27.0 4.2 33.9 47.0 25.3 25.1 28.0
CLIP(ViT-BigG/14) 2.5B 52.3 14.0 50.5 60.3 38.9 45.8 44.3
EVA-CLIP 8B 56.0 10.4 49.2 58.9 38.1 45.6 43.7
E5-V(Phi3.5-V) 4.2B 39.1 9.6 38.0 57.6 33.1 31.9 36.1
E5-V(LLaVA-1.6) 7B 39.7 10.8 39.4 60.2 34.2 33.4 37.5
UniME(Phi3.5-V) 4.2B 42.5(+3.4) 18.3(+8.7) 40.5(+2.5) 59.9(+2.3) 36.0(+2.9) 38.3(+6.4) 40.3(+4.2)
UniME(LLaVA-1.6) 7B 43.0(+3.3) 17.7(+6.9) 42.5(+3.1) 63.2(+3.0) 37.6(+3.4) 38.6(+5.2) 41.6(+4.1)
Fine-tuning on MMEB
CLIP(ViT-L) 0.4B 55.2 19.7 53.2 62.2 47.6 42.8 47.6
VLM2Vec(Phi3.5-V) 4.2B 54.8 54.9 62.3 79.5 66.5 52.0 62.9
VLM2Vec(LLaVA-1.6) 7B 56.8 50.4 63.3 82.6 64.9 53.6 63.3
UniME(Phi3.5-V) 4.2B 54.8(+0.0) 55.9(+1.0) 64.5(+2.2) 81.8(+2.3) 68.2(+1.7) 52.7(+0.7) 64.2(+1.3)
UniME(LLaVA-1.6) 7B 60.6(+3.8) 52.9(+2.5) 67.9(+4.6) 85.1(+2.5) 68.4(+3.5) 57.9(+4.0) 66.6(+3.3)

Results of zero-shot text-image retrieval on short caption datasets (Flickr30K and MS-COCO), long caption datasets (ShareGPT4V and Urban1K) and compositional benchmark (SugarCrepe). The reported scores are the average Recall@1 over the corresponding datasets. The best results are marked in bold. : UniME with textual discrimination distillation only. : UniME with both textual discrimination distillation and hard negative enhanced instruction tuning.

Models #Parameters Short Caption Retrieval Long Caption Retrieval Compositional Retrieval
Flickr30K COCO ShareGPT4V Urban1K SugarCrepe
qi→ct qt→ci qi→ct qt→ci qi→ct qt→ci qi→ct qt→ci Replace Swap Add
OpenCLIP(ViT-L) 0.4B 67.3 87.2 37.0 58.1 81.8 84.0 47.0 47.0 79.5 62.7 74.9
CLIP(ViT-BigG/14) 2.5B 79.5 92.9 51.3 67.3 90.1 93.6 77.8 80.7 86.5 68.9 88.4
EVA-CLIP 8B 80.3 94.5 52.0 70.1 93.1 91.2 80.4 77.8 85.9 70.3 86.7
E5-V(Phi3.5-V) 4.2B 72.2 79.6 44.7 53.4 86.0 88.5 83.8 83.6 88.2 66.6 75.3
E5-V(LLaVA-1.6) 7B 77.3 85.7 49.1 57.6 85.1 82.1 88.9 83.2 86.3 68.7 66.9
UniME(Phi3.5-V) 4.2B 72.0(-0.2) 80.6(+1.0) 44.9(+0.2) 57.2(+0.8) 86.8(+3.8) 92.3(+1.3) 85.1(+2.3) 86.9(+3.3) 90.2(+2.0) 67.6(+1.0) 91.2(+15.9)
UniME(LLaVA-1.6) 7B 77.2(-0.1) 84.6(-1.1) 51.0(+1.9) 56.4(-1.2) 89.8(+4.7) 86.9(+4.8) 91.3(+2.4) 82.4(-0.8) 89.5(+3.2) 64.8(-3.9) 94.2(+27.3)
VLM2Vec(Phi3.5-V) 4.2B 68.7 83.0 43.7 59.8 90.1 92.0 87.9 86.8 86.2 66.7 84.2
VLM2Vec(LLaVA-1.6) 7B 76.0 90.6 46.8 66.6 85.8 90.7 84.7 90.8 85.8 66.3 86.5
UniME(Phi3.5-V) 4.2B 77.0(+11.3) 88.2(+5.2) 49.8(+6.1) 66.8(+7.0) 92.1(+2.0) 96.4(+4.4) 92.7(+4.8) 95.1(+8.3) 90.1(+3.9) 70.9(+4.2) 93.3(+9.1)
UniME(LLaVA-1.6) 7B 81.9(+5.9) 93.4(+2.8) 53.7(+6.1) 70.1(+3.5) 93.9(+8.1) 97.2(+6.5) 95.2(+10.5) 95.9(+5.1) 89.0(+3.2) 71.5(+5.2) 94.4(+7.9)

Qualitative Analysis

Overview Illustration
Overview Illustration

To evaluate the semantic expressiveness of UniME embeddings, we visualize the top-k token prediction probabilities using the prompt "<Image> Summary above image in one word:". Initially, predictions are abstract (e.g., "Pastoral", "Peaceful"). After textual discriminative knowledge distillation, they become more concrete (e.g., "cow", "waterfront"), though still dominated by "Farm". Following hard negative enhanced instruction tuning, the distribution spreads more evenly across semantically relevant tokens, indicating improved discriminative power and alignment with image content.

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


      @misc{gu2025breakingmodalitybarrieruniversal,
        title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
        author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
        year={2025}
      }