SlowFocus: Enhancing Fine-Grained Temporal Understanding in Video LLM

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.

Cite

Text

Nie et al. "SlowFocus: Enhancing Fine-Grained Temporal Understanding in Video LLM." Neural Information Processing Systems, 2024. doi:10.52202/079017-2599

Markdown

[Nie et al. "SlowFocus: Enhancing Fine-Grained Temporal Understanding in Video LLM." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/nie2024neurips-slowfocus/) doi:10.52202/079017-2599

BibTeX

@inproceedings{nie2024neurips-slowfocus,
  title     = {{SlowFocus: Enhancing Fine-Grained Temporal Understanding in Video LLM}},
  author    = {Nie, Ming and Ding, Dan and Wang, Chunwei and Guo, Yuanfan and Han, Jianhua and Xu, Hang and Zhang, Li},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2599},
  url       = {https://mlanthology.org/neurips/2024/nie2024neurips-slowfocus/}
}