PACS: A Dataset for Physical Audiovisual Commonsense Reasoning

Abstract

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

Cite

Text

Yu et al. "PACS: A Dataset for Physical Audiovisual Commonsense Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19836-6

Markdown

[Yu et al. "PACS: A Dataset for Physical Audiovisual Commonsense Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/yu2022eccv-pacs/) doi:10.1007/978-3-031-19836-6

BibTeX

@inproceedings{yu2022eccv-pacs,
  title     = {{PACS: A Dataset for Physical Audiovisual Commonsense Reasoning}},
  author    = {Yu, Samuel and Wu, Peter and Liang, Paul Pu and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19836-6},
  url       = {https://mlanthology.org/eccv/2022/yu2022eccv-pacs/}
}