Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)

Abstract

To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.

Cite

Text

Dubey and Pack. "Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35248

Markdown

[Dubey and Pack. "Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/dubey2025aaai-leveraging/) doi:10.1609/AAAI.V39I28.35248

BibTeX

@inproceedings{dubey2025aaai-leveraging,
  title     = {{Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)}},
  author    = {Dubey, Harsh and Pack, Chulwoo},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {29351-29352},
  doi       = {10.1609/AAAI.V39I28.35248},
  url       = {https://mlanthology.org/aaai/2025/dubey2025aaai-leveraging/}
}