RealSyn : An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

1The university of sydney
2DeepGlint
3Imperial College London
* equal contributions

Part1: Introduction

Motivation

Overview Illustration

Open-ended Questions

  • How to utilize multimodal interleaved documents for vision-language representation learning?
  • How to effectively leverage both realistic and synthetic texts to enhance representation performance?

We first establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant texts. To enhance fine-grained image understanding, we propose a visual semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts.

RealSyn Dataset

Dataset #Images #Avg Tokens / Image #Avg Texts / Image Text Type Source Type
CC12M 12,000,000 - 1 Real-World Website
YFCC15M 15,000,000 16 1 Real-World Website
CapsFusion 120,000,000 - 1 Synthetic Image-Text pair
LAION400M 400,000,000 27 1 Real-World Website
RealSyn15M 15,250,144 40 4 Real-World & Synthetic Interleaved Image-Text Document
RealSyn30M 30,328,852 38 4 Real-World & Synthetic Interleaved Image-Text Document
RealSyn100M 100,862,786 36 4 Real-World & Synthetic Interleaved Image-Text Document
  • Statistic Comparison: RealSyn dataset provides four textual descriptions for each image, with an average token length of 36-40, significantly higher than LAION400M and YFCC15M. Furthermore, unlike previous datasets, the RealSyn dataset is sourced from real-world interleaved image-text documents and includes both realistic and synthetic texts.
  • Topic-based Assesment: Notably, samples related to flower and automotive constitute only 0.4% and 0.9% of the dataset. Due to the low number of samples, the model fails to adequately learn these concepts, resulting in poor linear probe and zero-shot transfer performance on the Flowers and Cars dataset.
  • Richness Accessment: Both the retrieved real-world sentences and semantic augmented synthetic captions encompass a larger quantity of words, which can provide extensive textual information and improve vision-language pre-training.
  • Diversity Accessment: Both the most relevant retrieved real-world sentence and semantic augmented synthetic caption of the RealSyn contain a larger number of distinct entities. This diversity of the data effectively guides the model in learning richer knowledge, thereby improving the performance and robustness.

Part 2: Methodology

Real-World Data Extraction

Steps

  • Data Extraction: A random sample of 118 million image-text documents is selected from OBELICS. Images are stored in a dedicated image database, and sentences are segmented using NLTK and stored in a separate sentence database.
  • Visual Image Filtration: Low-quality images are removed based on size and aspect ratio. Redundant images are filtered using EVA02-CLIP and the Union-Find algorithm.
  • Textual Knowledge Filtration: Sentences are filtered by eliminating those with emojis, URLs, or outside the word count range (3-81). Sentences with at least C1 caption complexity and containing an action are retained, according to CAT rules.

Retrieval and Generation Framework

Key Contributions

  • Hierarchical Retrieval: We use EVA02-CLIP and K-means, clustering 0.84B sentences into 2M clusters. We first perform inter-cluster retrieval to find the most relevant cluster center for each image. Then, we group images sharing the same cluster center and perform intra-cluster retrieval to obtain multiple semantically relevant sentences.
  • Image Semantic Augmented Generation: We employ the OFA model to generate concise captions for images, enhancing fine-grained visual information. Subsequently, the RAM++ model, an open-set image tagging system, extracts object detection tags. We use ChatGPT-4 Turbo to integrate these captions and tags with realistic texts, constructing a dataset of 100,000 instructions. We then fine-tune the LLaMA3-8B model using the LLaMA Factory and deploy the vLLM for large-scale synthetic text generation.
  • Semantic Balance Sampling: To mitigate the impact of OCR-related or mismatched pairs in pre-training, we exclude 29.7 million pairs exhibiting cosine similarities outside the 0.51 to 0.61 range. We then cluster the embeddings of the remaining 168.3 million image pairs around 1 million centers. To promote semantic diversity, we randomly select 20, 35, and 180 samples from larger clusters and retain all samples from smaller clusters.

Part 3: Evaluating RealSyn in downstream benchmarks

Pretraining ViT-B/32 and ViT-B/16 model on different scales LAION and RealSyn datasets, and got SOTA results in all downstream tasks.

DataScale Dataset ViT-B/32 ViT-B/16
Linear Probe Zero Shot Classification Zero Shot Robustness Linear Probe Zero Shot Classification Zero Shot Robustness
15M YFCC 64.5 33.6 18.4 67.5 35.6 21.6
LAION 69.8 42.7 27.2 73.3 46.2 31.9
Ours 71.4 47.9 31.5 74.9 51.6 37.1
30M LAION 72.6 48.6 33.6 75.5 51.3 38.1
Ours 73.9 52.1 37.8 77.5 54.9 43.8
100M LAION 74.4 53.9 39.9 77.2 56.1 44.3
Ours 75.8 56.2 42.7 79.5 59.5 50.3

Linear probe on 20 downstream datasets. Pre-training ViT-B/32 on RealSyn achieves 1.3%-6.9% average performance improvement.

DataScale Dataset Average IN1k Food101 CIFAR10 CIFAR100 BirdSnap SUN397 Cars Aircraft DTD Pets Caltech Flowers STL10 EuroSAT RESISC45 KITTI Country UCF101 Memes SST2
15M YFCC 64.5 56.7 67.2 90.4 70.8 47.7 66.7 23.8 29.7 62.4 65.7 80.1 90.0 94.7 94.9 79.4 75.4 18.4 70.8 48.6 56.2
LAION 69.8 59.3 71.0 93.3 78.1 41.0 66.3 76.9 43.0 71.2 74.5 87.6 88.2 93.6 95.3 82.9 72.2 13.5 75.4 55.7 57.3
Ours 71.4 64.0 77.1 94.5 78.7 43.4 71.4 64.7 42.7 71.3 79.9 90.0 88.2 96.4 96.2 87.2 72.4 16.7 79.9 55.7 57.7
30M LAION 72.6 64.3 76.1 94.5 80.0 47.4 70.3 82.3 45.9 74.7 80.3 89.8 89.5 95.6 95.5 84.5 72.6 15.2 76.6 56.2 60.0
Ours 73.9 68.5 81.2 95.4 81.8 48.4 74.5 73.4 45.2 74.2 84.1 91.3 90.6 97.2 96.5 89.2 74.5 19.0 82.6 55.0 56.2
100M LAION 74.4 68.3 80.2 95.7 82.5 51.3 73.4 85.3 46.1 75.6 83.2 91.1 92.0 96.9 95.2 85.9 68.4 17.4 80.0 57.3 61.4
Ours 75.8 71.6 84.2 96.3 83.5 54.0 76.2 77.4 47.6 75.6 86.3 92.1 91.7 97.7 96.8 90.6 73.1 21.1 83.7 57.3 58.9

Zero-shot image-text retrieval performance on Flickr30k and MSCOCO. Pre-training CLIP-B/32 on RealSyn dataset achieves a significant improvement on all metrics.

DataScale Dataset MSCOCO Flickr30k
I2T@1 I2T@5 I2T@10 T2I@1 T2I@5 T2I@10 I2T@1 I2T@5 I2T@10 T2I@1 T2I@5 T2I@10
15M YFCC 13.2 32.0 43.1 21.3 45.1 57.0 23.5 47.3 58.3 37.1 64.8 75.9
LAION 17.4 38.3 49.7 28.4 53.0 64.9 33.3 60.5 70.9 49.1 76.8 84.5
ours 25.8 50.6 62.5 43.8 69.5 79.6 49.5 76.3 84.6 72.9 91.1 95.1
30M LAION 22.1 45.5 57.6 35.9 62.4 73.2 42.4 70.1 79.4 59.6 83.5 89.8
ours 29.5 55.2 66.9 48.2 74.6 83.0 54.0 80.0 87.6 76.0 93.3 96.9
100M LAION 27.1 52.1 63.8 43.3 68.0 78.1 50.4 77.2 85.5 67.5 87.9 93.0
ours 32.5 58.9 70.2 52.3 76.7 85.0 58.8 84.1 90.5 81.6 96.1 97.3

Zero-shot transfer on 20 downstream datasets. Pre-training ViT-B/32 on RealSyn achieves 2.3%-14.3% average performance improvement.

DataScale Dataset Average IN1k Food101 CIFAR10 CIFAR100 BirdSnap SUN397 Cars Aircraft DTD Pets Caltech Flowers STL10 EuroSAT RESISC45 KITTI Country UCF101 Memes SST2
15M YFCC 33.6 32.3 36.3 74.0 40.3 19.4 41.8 2.1 2.3 12.0 19.8 59.8 48.9 87.7 21.2 20.3 23.8 5.1 27.8 47.4 50.1
LAION 42.7 37.1 49.1 85.7 56.9 11.5 45.1 49.9 3.8 25.7 54.6 78.1 30.5 89.5 36.7 36.1 21.7 5.6 38.2 48.8 49.9
Ours 47.9 43.3 60.0 85.7 58.3 10.5 56.4 27.6 5.5 33.2 61.7 80.2 31.2 92.4 56.5 56.2 34.0 8.9 52.6 53.3 51.3
30M LAION 48.6 44.9 58.9 85.9 63.1 17.4 54.8 61.0 4.3 36.4 65.5 82.0 41.3 91.3 40.3 43.7 24.3 7.2 47.4 51.5 50.1
Ours 52.1 50.9 67.5 89.0 65.2 15.0 60.6 39.2 7.9 37.8 70.5 84.0 42.2 93.8 59.9 61.9 27.7 10.6 56.7 52.5 50.1
100M LAION 53.9 52.8 68.9 90.5 68.6 23.6 60.6 68.3 7.8 41.2 74.7 87.1 47.7 94.4 45.6 53.4 23.6 10.4 54.5 51.9 53.3
Ours 56.2 56.2 73.5 89.5 68.8 20.1 65.0 48.5 10.2 46.1 76.7 87.6 48.8 94.4 69.0 65.5 24.6 12.1 60.5 52.4 54.1

Zero-shot robustness comparison. Pre-training CLIP-B/32 on RealSyn demonstrates superior robustness across all datasets.

DataScale Dataset Average INv2 INA INR ObjectNet INS
15M YFCC 18.4 27.3 12.3 20.8 25.3 6.3
LAION 27.2 30.7 6.0 46.5 28.7 24.3
ours 31.5 37.1 12.5 47.7 35.0 25.4
30M LAION 33.6 37.5 8.9 54.4 35.5 31.8
ours 37.8 42.9 16.1 56.6 41.5 31.9
100M laion 39.9 44.6 12.2 62.5 42.2 37.9
ours 42.7 47.6 19.7 62.5 45.8 37.9

Part 4: Dataset Analysis

  • Model Scaling: Compared to LAION, RealSyn demonstrates steeper slopes in performance curves across linear probing, zero-shot transfer, and robustness, indicative of its superior model scaling capabilities.
  • Data Scaling Law: Both the retrieved real-world sentences and semantic augmented synthetic captions encompass a larger quantity of words, which can provide extensive textual information and improve vision-language pre-training.

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


        @misc{gu2025realsyn,
          title={RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm}, 
          author={Tiancheng Gu and Kaicheng Yang and Chaoyi Zhang and Yin Xie and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng},
          year={2025},
          eprint={2502.12513},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }