Quality and Relevance Metrics for Selection of Multimodal Pretraining Data
Abstract
Self-supervised pretraining has become a strong force in both language and vision tasks. Current efforts to improve the effects of pretraining focus on improving network architecture or defining new tasks to extract representations from the data. We focus on a third axis, the data itself, to quantify and measure how different sources and quality of data can affect the learned representations. As pretraining datasets grow larger and larger, the cost of pretraining will continue to increase. This issue is especially acute for visuolingusitic data, as the cost of storage and processing for image and video data will rise quickly. We therefore examine four vi- suolinguistic datasets (three preexisting datasets and one collected by us) for their utility as pretraining datasets. We define metrics for dataset quality and relevance, propose a method for subsampling large corpuses for the data most relevant to a set of downstream multimodal vision and language tasks of interest, and show that this method increases performance across the board for all downstream tasks.
Cite
Text
Rao et al. "Quality and Relevance Metrics for Selection of Multimodal Pretraining Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020. doi:10.1109/CVPRW50498.2020.00486Markdown
[Rao et al. "Quality and Relevance Metrics for Selection of Multimodal Pretraining Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.](https://mlanthology.org/cvprw/2020/rao2020cvprw-quality/) doi:10.1109/CVPRW50498.2020.00486BibTeX
@inproceedings{rao2020cvprw-quality,
title = {{Quality and Relevance Metrics for Selection of Multimodal Pretraining Data}},
author = {Rao, Roshan and Rao, Sudha and Nouri, Elnaz and Dey, Debadeepta and Celikyilmaz, Asli and Dolan, Bill},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2020},
pages = {4109-4116},
doi = {10.1109/CVPRW50498.2020.00486},
url = {https://mlanthology.org/cvprw/2020/rao2020cvprw-quality/}
}