HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning

Abstract

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical dense memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical dense memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Cite

Text

Kim et al. "HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32451

Markdown

[Kim et al. "HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/kim2025aaai-hicm/) doi:10.1609/AAAI.V39I4.32451

BibTeX

@inproceedings{kim2025aaai-hicm,
  title     = {{HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning}},
  author    = {Kim, Minkuk and Kim, Hyeon Bae and Moon, Jinyoung and Choi, Jinwoo and Kim, Seong Tae},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4293-4301},
  doi       = {10.1609/AAAI.V39I4.32451},
  url       = {https://mlanthology.org/aaai/2025/kim2025aaai-hicm/}
}