VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-Training
Abstract
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community’s progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models’ generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to images unseen during pre-training and are practical in terms of efficiency-performance trade-off.
Cite
Text
Zhou et al. "VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-Training." International Conference on Machine Learning, 2022.Markdown
[Zhou et al. "VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-Training." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/zhou2022icml-vlue/)BibTeX
@inproceedings{zhou2022icml-vlue,
title = {{VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-Training}},
author = {Zhou, Wangchunshu and Zeng, Yan and Diao, Shizhe and Zhang, Xinsong},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {27395-27411},
volume = {162},
url = {https://mlanthology.org/icml/2022/zhou2022icml-vlue/}
}