Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)
Abstract
To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.
Cite
Text
Dubey and Pack. "Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35248Markdown
[Dubey and Pack. "Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/dubey2025aaai-leveraging/) doi:10.1609/AAAI.V39I28.35248BibTeX
@inproceedings{dubey2025aaai-leveraging,
title = {{Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)}},
author = {Dubey, Harsh and Pack, Chulwoo},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {29351-29352},
doi = {10.1609/AAAI.V39I28.35248},
url = {https://mlanthology.org/aaai/2025/dubey2025aaai-leveraging/}
}