Can Retelling Have Adequate Information for Reasoning? an Enhancement Method for Imperfect Video Understanding with Large Language Model

Abstract

Large Language Models (LLMs) demonstrate strong capabilities in video understanding. However, it exhibits hallucinations and factual errors in video description. On the one hand, existing Multimodal Large Language Models (MLLMs) are primarily trained by combining language models and vision models, with their visual understanding capabilities depending on the performance of the backbone. Moreover, video descriptions often suffer from incomplete content and the possibility of errors. Given the proven assessment of the strong reasoning capabilities of LLMs, this paper proposes ERSR, a novel Entity and Relationship based Self-Enhanced Reasoning method for imperfect video understanding. Specifically, an entities and relationships strategy is designed to perform scene graphs based on the limited observed entity relationships, thereby enhancing video descriptions. Furthermore, by providing question feedbacks, a self-enhanced forward and feedback reasoning strategy is provided to enhance reasoning logic. Finally, the prediction question answering results are re-validated through rethinking and verifying using the LLMs. Extensive experiments show that the proposed method achieves competitive results on real-world video understanding datasets, with an overall improvement of no less than 1.4%.

Cite

Text

Li et al. "Can Retelling Have Adequate Information for Reasoning? an Enhancement Method for Imperfect Video Understanding with Large Language Model." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/906

Markdown

[Li et al. "Can Retelling Have Adequate Information for Reasoning? an Enhancement Method for Imperfect Video Understanding with Large Language Model." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/li2025ijcai-retelling/) doi:10.24963/IJCAI.2025/906

BibTeX

@inproceedings{li2025ijcai-retelling,
  title     = {{Can Retelling Have Adequate Information for Reasoning? an Enhancement Method for Imperfect Video Understanding with Large Language Model}},
  author    = {Li, Mingxin and Wang, Wenhao and Ji, Hongru and Li, Xianghua and Gao, Chao},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8150-8158},
  doi       = {10.24963/IJCAI.2025/906},
  url       = {https://mlanthology.org/ijcai/2025/li2025ijcai-retelling/}
}