Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Abstract
Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are more robust when text is perturbed versus when video is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language learning. The benchmark introduced in this study along with the code and datasets is available at https://bit.ly/3CNOly4.
Cite
Text
Schiappa et al. "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations." Neural Information Processing Systems, 2022.Markdown
[Schiappa et al. "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/schiappa2022neurips-robustness/)BibTeX
@inproceedings{schiappa2022neurips-robustness,
title = {{Robustness Analysis of Video-Language Models Against Visual and Language Perturbations}},
author = {Schiappa, Madeline and Vyas, Shruti and Palangi, Hamid and Rawat, Yogesh and Vineet, Vibhav},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/schiappa2022neurips-robustness/}
}