Video-and-Language (VidL) Models and Their Cognitive Relevance

Abstract

In this paper we give a narrative review of multi-modal video-language (VidL) models. We introduce the current landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artificial intelligence (AI) in general. We argue that iterative feedback loops between AI, neuroscience, and cognitive science are essential to spur progress across these disciplines. We motivate why we focus specifically on VidL models and their benchmarks as a promising type of model to bring improvements in AI and categorise current VidL efforts across multiple ‘cognitive relevance axioms’. Finally, we provide suggestions on how to effectively incorporate this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to create awareness of the potential of VidL models to narrow the gap between neuroscience, cognitive science, and AI.

Cite

Text

Zonneveld et al. "Video-and-Language (VidL) Models and Their Cognitive Relevance." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00040

Markdown

[Zonneveld et al. "Video-and-Language (VidL) Models and Their Cognitive Relevance." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/zonneveld2023iccvw-videoandlanguage/) doi:10.1109/ICCVW60793.2023.00040

BibTeX

@inproceedings{zonneveld2023iccvw-videoandlanguage,
  title     = {{Video-and-Language (VidL) Models and Their Cognitive Relevance}},
  author    = {Zonneveld, Anne and Gatt, Albert and Calixto, Iacer},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {325-338},
  doi       = {10.1109/ICCVW60793.2023.00040},
  url       = {https://mlanthology.org/iccvw/2023/zonneveld2023iccvw-videoandlanguage/}
}