RealSyn

Motivation

Open-ended Questions

How to utilize multimodal interleaved documents for vision-language representation learning?
How to effectively leverage both realistic and synthetic texts to enhance representation performance?

We first establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant texts. To enhance fine-grained image understanding, we propose a visual semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts.

RealSyn Dataset

Dataset	#Images	#Avg Tokens / Image	#Avg Texts / Image	Text Type	Source Type
CC12M	12,000,000	-	1	Real-World	Website
YFCC15M	15,000,000	16	1	Real-World	Website
CapsFusion	120,000,000	-	1	Synthetic	Image-Text pair
LAION400M	400,000,000	27	1	Real-World	Website
RealSyn15M	15,250,144	40	4	Real-World & Synthetic	Interleaved Image-Text Document
RealSyn30M	30,328,852	38	4	Real-World & Synthetic	Interleaved Image-Text Document
RealSyn100M	100,862,786	36	4	Real-World & Synthetic	Interleaved Image-Text Document

Statistic Comparison: RealSyn dataset provides four textual descriptions for each image, with an average token length of 36-40, significantly higher than LAION400M and YFCC15M. Furthermore, unlike previous datasets, the RealSyn dataset is sourced from real-world interleaved image-text documents and includes both realistic and synthetic texts.

Topic-based Assesment: Notably, samples related to flower and automotive constitute only 0.4% and 0.9% of the dataset. Due to the low number of samples, the model fails to adequately learn these concepts, resulting in poor linear probe and zero-shot transfer performance on the Flowers and Cars dataset.

Richness Accessment: Both the retrieved real-world sentences and semantic augmented synthetic captions encompass a larger quantity of words, which can provide extensive textual information and improve vision-language pre-training.

Diversity Accessment: Both the most relevant retrieved real-world sentence and semantic augmented synthetic caption of the RealSyn contain a larger number of distinct entities. This diversity of the data effectively guides the model in learning richer knowledge, thereby improving the performance and robustness.

Real-World Data Extraction

Steps

Data Extraction: A random sample of 118 million image-text documents is selected from OBELICS. Images are stored in a dedicated image database, and sentences are segmented using NLTK and stored in a separate sentence database.
Visual Image Filtration: Low-quality images are removed based on size and aspect ratio. Redundant images are filtered using EVA02-CLIP and the Union-Find algorithm.
Textual Knowledge Filtration: Sentences are filtered by eliminating those with emojis, URLs, or outside the word count range (3-81). Sentences with at least C1 caption complexity and containing an action are retained, according to CAT rules.

Retrieval and Generation Framework

Key Contributions

Hierarchical Retrieval: We use EVA02-CLIP and K-means, clustering 0.84B sentences into 2M clusters. We first perform inter-cluster retrieval to find the most relevant cluster center for each image. Then, we group images sharing the same cluster center and perform intra-cluster retrieval to obtain multiple semantically relevant sentences.
Image Semantic Augmented Generation: We employ the OFA model to generate concise captions for images, enhancing fine-grained visual information. Subsequently, the RAM++ model, an open-set image tagging system, extracts object detection tags. We use ChatGPT-4 Turbo to integrate these captions and tags with realistic texts, constructing a dataset of 100,000 instructions. We then fine-tune the LLaMA3-8B model using the LLaMA Factory and deploy the vLLM for large-scale synthetic text generation.
Semantic Balance Sampling: To mitigate the impact of OCR-related or mismatched pairs in pre-training, we exclude 29.7 million pairs exhibiting cosine similarities outside the 0.51 to 0.61 range. We then cluster the embeddings of the remaining 168.3 million image pairs around 1 million centers. To promote semantic diversity, we randomly select 20, 35, and 180 samples from larger clusters and retain all samples from smaller clusters.

Main Average Results
Linear Probe
Zero-shot Transfer
Zero-shot Retrieval
Robustness

Pretraining ViT-B/32 and ViT-B/16 model on different scales LAION and RealSyn datasets, and got SOTA results in all downstream tasks.

DataScale	Dataset	ViT-B/32			ViT-B/16
DataScale	Dataset	Linear Probe	Zero Shot Classification	Zero Shot Robustness	Linear Probe	Zero Shot Classification	Zero Shot Robustness
15M	YFCC	64.5	33.6	18.4	67.5	35.6	21.6
	LAION	69.8	42.7	27.2	73.3	46.2	31.9
	Ours	71.4	47.9	31.5	74.9	51.6	37.1
30M	LAION	72.6	48.6	33.6	75.5	51.3	38.1
30M	Ours	73.9	52.1	37.8	77.5	54.9	43.8
100M	LAION	74.4	53.9	39.9	77.2	56.1	44.3
100M	Ours	75.8	56.2	42.7	79.5	59.5	50.3

Linear probe on 20 downstream datasets. Pre-training ViT-B/32 on RealSyn achieves 1.3%-6.9% average performance improvement.

DataScale	Dataset	Average	IN1k	Food101	CIFAR10	CIFAR100	BirdSnap	SUN397	Cars	Aircraft	DTD	Pets	Caltech	Flowers	STL10	EuroSAT	RESISC45	KITTI	Country	UCF101	Memes	SST2
15M	YFCC	64.5	56.7	67.2	90.4	70.8	47.7	66.7	23.8	29.7	62.4	65.7	80.1	90.0	94.7	94.9	79.4	75.4	18.4	70.8	48.6	56.2
	LAION	69.8	59.3	71.0	93.3	78.1	41.0	66.3	76.9	43.0	71.2	74.5	87.6	88.2	93.6	95.3	82.9	72.2	13.5	75.4	55.7	57.3
	Ours	71.4	64.0	77.1	94.5	78.7	43.4	71.4	64.7	42.7	71.3	79.9	90.0	88.2	96.4	96.2	87.2	72.4	16.7	79.9	55.7	57.7
30M	LAION	72.6	64.3	76.1	94.5	80.0	47.4	70.3	82.3	45.9	74.7	80.3	89.8	89.5	95.6	95.5	84.5	72.6	15.2	76.6	56.2	60.0
30M	Ours	73.9	68.5	81.2	95.4	81.8	48.4	74.5	73.4	45.2	74.2	84.1	91.3	90.6	97.2	96.5	89.2	74.5	19.0	82.6	55.0	56.2
100M	LAION	74.4	68.3	80.2	95.7	82.5	51.3	73.4	85.3	46.1	75.6	83.2	91.1	92.0	96.9	95.2	85.9	68.4	17.4	80.0	57.3	61.4
100M	Ours	75.8	71.6	84.2	96.3	83.5	54.0	76.2	77.4	47.6	75.6	86.3	92.1	91.7	97.7	96.8	90.6	73.1	21.1	83.7	57.3	58.9

Zero-shot image-text retrieval performance on Flickr30k and MSCOCO. Pre-training CLIP-B/32 on RealSyn dataset achieves a significant improvement on all metrics.

DataScale	Dataset	MSCOCO						Flickr30k
DataScale	Dataset	I2T@1	I2T@5	I2T@10	T2I@1	T2I@5	T2I@10	I2T@1	I2T@5	I2T@10	T2I@1	T2I@5	T2I@10
15M	YFCC	13.2	32.0	43.1	21.3	45.1	57.0	23.5	47.3	58.3	37.1	64.8	75.9
	LAION	17.4	38.3	49.7	28.4	53.0	64.9	33.3	60.5	70.9	49.1	76.8	84.5
	ours	25.8	50.6	62.5	43.8	69.5	79.6	49.5	76.3	84.6	72.9	91.1	95.1
30M	LAION	22.1	45.5	57.6	35.9	62.4	73.2	42.4	70.1	79.4	59.6	83.5	89.8
30M	ours	29.5	55.2	66.9	48.2	74.6	83.0	54.0	80.0	87.6	76.0	93.3	96.9
100M	LAION	27.1	52.1	63.8	43.3	68.0	78.1	50.4	77.2	85.5	67.5	87.9	93.0
100M	ours	32.5	58.9	70.2	52.3	76.7	85.0	58.8	84.1	90.5	81.6	96.1	97.3

Zero-shot transfer on 20 downstream datasets. Pre-training ViT-B/32 on RealSyn achieves 2.3%-14.3% average performance improvement.

DataScale	Dataset	Average	IN1k	Food101	CIFAR10	CIFAR100	BirdSnap	SUN397	Cars	Aircraft	DTD	Pets	Caltech	Flowers	STL10	EuroSAT	RESISC45	KITTI	Country	UCF101	Memes	SST2
15M	YFCC	33.6	32.3	36.3	74.0	40.3	19.4	41.8	2.1	2.3	12.0	19.8	59.8	48.9	87.7	21.2	20.3	23.8	5.1	27.8	47.4	50.1
	LAION	42.7	37.1	49.1	85.7	56.9	11.5	45.1	49.9	3.8	25.7	54.6	78.1	30.5	89.5	36.7	36.1	21.7	5.6	38.2	48.8	49.9
	Ours	47.9	43.3	60.0	85.7	58.3	10.5	56.4	27.6	5.5	33.2	61.7	80.2	31.2	92.4	56.5	56.2	34.0	8.9	52.6	53.3	51.3
30M	LAION	48.6	44.9	58.9	85.9	63.1	17.4	54.8	61.0	4.3	36.4	65.5	82.0	41.3	91.3	40.3	43.7	24.3	7.2	47.4	51.5	50.1
30M	Ours	52.1	50.9	67.5	89.0	65.2	15.0	60.6	39.2	7.9	37.8	70.5	84.0	42.2	93.8	59.9	61.9	27.7	10.6	56.7	52.5	50.1
100M	LAION	53.9	52.8	68.9	90.5	68.6	23.6	60.6	68.3	7.8	41.2	74.7	87.1	47.7	94.4	45.6	53.4	23.6	10.4	54.5	51.9	53.3
100M	Ours	56.2	56.2	73.5	89.5	68.8	20.1	65.0	48.5	10.2	46.1	76.7	87.6	48.8	94.4	69.0	65.5	24.6	12.1	60.5	52.4	54.1

Zero-shot robustness comparison. Pre-training CLIP-B/32 on RealSyn demonstrates superior robustness across all datasets.

DataScale	Dataset	Average	INv2	INA	INR	ObjectNet	INS
15M	YFCC	18.4	27.3	12.3	20.8	25.3	6.3
	LAION	27.2	30.7	6.0	46.5	28.7	24.3
	ours	31.5	37.1	12.5	47.7	35.0	25.4
30M	LAION	33.6	37.5	8.9	54.4	35.5	31.8
30M	ours	37.8	42.9	16.1	56.6	41.5	31.9
100M	laion	39.9	44.6	12.2	62.5	42.2	37.9
100M	ours	42.7	47.6	19.7	62.5	45.8	37.9

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃

@misc{gu2025realsyn, title={RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm}, author={Tiancheng Gu and Kaicheng Yang and Chaoyi Zhang and Yin Xie and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng}, year={2025}, eprint={2502.12513}, archivePrefix={arXiv}, primaryClass={cs.CV} }

RealSyn : An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Part1: Introduction

Motivation

Open-ended Questions

RealSyn Dataset

Key Findings

Part 2: Methodology

Real-World Data Extraction

Steps

Retrieval and Generation Framework

Key Contributions

Part 3: Evaluating RealSyn in downstream benchmarks

Part 4: Dataset Analysis

BibTeX