Are VideoQA Models Truly Multimodal?

Abstract

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? Hence, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to critically analyze multimodal representations. QUAG facilitates combined dataset-model study by systematic ablation of model's coupled multimodal understanding during inference. Surprisingly, it demonstrates that the models manage to maintain high performance even under multimodal impairment. This indicates that the current VideoQA benchmarks and metrics do not penalize models that find shortcuts and discount joint multimodal understanding. Motivated by this, we propose $\textit{CLAVI}$ (Counterfactual in LAnguage and VIdeo), a diagnostic dataset for coupled multimodal understanding in VideoQA. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have very poor performance on the counterfactual instances that necessitate joint multimodal understanding. Overall, we show that many VideoQA models are incapable of learning multimodal representations and that their success on standard datasets is an illusion of joint multimodal understanding.

Cite

Text

Rawal et al. "Are VideoQA Models Truly Multimodal?." NeurIPS 2023 Workshops: XAIA, 2023.

Markdown

[Rawal et al. "Are VideoQA Models Truly Multimodal?." NeurIPS 2023 Workshops: XAIA, 2023.](https://mlanthology.org/neuripsw/2023/rawal2023neuripsw-videoqa/)

BibTeX

@inproceedings{rawal2023neuripsw-videoqa,
  title     = {{Are VideoQA Models Truly Multimodal?}},
  author    = {Rawal, Ishaan and Jaiswal, Shantanu and Fernando, Basura and Tan, Cheston},
  booktitle = {NeurIPS 2023 Workshops: XAIA},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/rawal2023neuripsw-videoqa/}
}