Enhancing Vision-Language Pre-Training with Rich Supervisions
Abstract
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4 we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that compared to current screenshot pre-training objectives our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning.
Cite
Text
Gao et al. "Enhancing Vision-Language Pre-Training with Rich Supervisions." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01280Markdown
[Gao et al. "Enhancing Vision-Language Pre-Training with Rich Supervisions." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/gao2024cvpr-enhancing/) doi:10.1109/CVPR52733.2024.01280BibTeX
@inproceedings{gao2024cvpr-enhancing,
title = {{Enhancing Vision-Language Pre-Training with Rich Supervisions}},
author = {Gao, Yuan and Shi, Kunyu and Zhu, Pengkai and Belval, Edouard and Nuriel, Oren and Appalaraju, Srikar and Ghadar, Shabnam and Tu, Zhuowen and Mahadevan, Vijay and Soatto, Stefano},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13480-13491},
doi = {10.1109/CVPR52733.2024.01280},
url = {https://mlanthology.org/cvpr/2024/gao2024cvpr-enhancing/}
}