From Recognition to Cognition: Visual Commonsense Reasoning

Abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle ( 45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines ( 65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

Cite

Text

Zellers et al. "From Recognition to Cognition: Visual Commonsense Reasoning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00688

Markdown

[Zellers et al. "From Recognition to Cognition: Visual Commonsense Reasoning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/zellers2019cvpr-recognition/) doi:10.1109/CVPR.2019.00688

BibTeX

@inproceedings{zellers2019cvpr-recognition,
  title     = {{From Recognition to Cognition: Visual Commonsense Reasoning}},
  author    = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00688},
  url       = {https://mlanthology.org/cvpr/2019/zellers2019cvpr-recognition/}
}