Too Large; Data Reduction for Vision-Language Pre-Training
Abstract
This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.8M to 0.67M ( 24%) and noisy YFCC15M from 15M to 2.5M ( 16.7%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset.
Cite
Text
Wang et al. "Too Large; Data Reduction for Vision-Language Pre-Training." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00292Markdown
[Wang et al. "Too Large; Data Reduction for Vision-Language Pre-Training." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wang2023iccv-too/) doi:10.1109/ICCV51070.2023.00292BibTeX
@inproceedings{wang2023iccv-too,
title = {{Too Large; Data Reduction for Vision-Language Pre-Training}},
author = {Wang, Alex Jinpeng and Lin, Kevin Qinghong and Zhang, David Junhao and Lei, Stan Weixian and Shou, Mike Zheng},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {3147-3157},
doi = {10.1109/ICCV51070.2023.00292},
url = {https://mlanthology.org/iccv/2023/wang2023iccv-too/}
}