From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering

CVPR 2022 pp. 21273-21282

doi:10.1109/CVPR52688.2022.02059 /cvpr/2022/li2022cvpr-representation/

Abstract

Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at https://github.com/bcmi/Causal-VidQA.git.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.02059

Markdown

[Li et al. "From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/li2022cvpr-representation/) doi:10.1109/CVPR52688.2022.02059

BibTeX

@inproceedings{li2022cvpr-representation,
  title     = {{From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering}},
  author    = {Li, Jiangtong and Niu, Li and Zhang, Liqing},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {21273-21282},
  doi       = {10.1109/CVPR52688.2022.02059},
  url       = {https://mlanthology.org/cvpr/2022/li2022cvpr-representation/}
}