Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning

Abstract

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Cite

Text

Liu et al. "Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00310

Markdown

[Liu et al. "Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/liu2025cvpr-commonsense/) doi:10.1109/CVPR52734.2025.00310

BibTeX

@inproceedings{liu2025cvpr-commonsense,
  title     = {{Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning}},
  author    = {Liu, Huabin and Ilievski, Filip and Snoek, Cees G. M.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3262-3271},
  doi       = {10.1109/CVPR52734.2025.00310},
  url       = {https://mlanthology.org/cvpr/2025/liu2025cvpr-commonsense/}
}