HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Cite

Text

Nguyen et al. "HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02714

Markdown

[Nguyen et al. "HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/nguyen2025cvpr-hyperglm/) doi:10.1109/CVPR52734.2025.02714

BibTeX

@inproceedings{nguyen2025cvpr-hyperglm,
  title     = {{HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation}},
  author    = {Nguyen, Trong-Thuan and Nguyen, Pha and Cothren, Jackson and Yilmaz, Alper and Luu, Khoa},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29150-29160},
  doi       = {10.1109/CVPR52734.2025.02714},
  url       = {https://mlanthology.org/cvpr/2025/nguyen2025cvpr-hyperglm/}
}