Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning
Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
Cite
Text
Liu et al. "Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00310Markdown
[Liu et al. "Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/liu2025cvpr-commonsense/) doi:10.1109/CVPR52734.2025.00310BibTeX
@inproceedings{liu2025cvpr-commonsense,
title = {{Commonsense Video Question Answering Through Video-Grounded Entailment Tree Reasoning}},
author = {Liu, Huabin and Ilievski, Filip and Snoek, Cees G. M.},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3262-3271},
doi = {10.1109/CVPR52734.2025.00310},
url = {https://mlanthology.org/cvpr/2025/liu2025cvpr-commonsense/}
}