Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Abstract
Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Systems and codes will be open later.
Cite
Text
Fei et al. "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition." International Conference on Machine Learning, 2024.Markdown
[Fei et al. "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/fei2024icml-videoofthought/)BibTeX
@inproceedings{fei2024icml-videoofthought,
title = {{Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition}},
author = {Fei, Hao and Wu, Shengqiong and Ji, Wei and Zhang, Hanwang and Zhang, Meishan and Lee, Mong-Li and Hsu, Wynne},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {13109-13125},
volume = {235},
url = {https://mlanthology.org/icml/2024/fei2024icml-videoofthought/}
}