Open-Vocabulary Video Relation Extraction

Abstract

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE.

Cite

Text

Tian et al. "Open-Vocabulary Video Relation Extraction." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28328

Markdown

[Tian et al. "Open-Vocabulary Video Relation Extraction." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/tian2024aaai-open/) doi:10.1609/AAAI.V38I6.28328

BibTeX

@inproceedings{tian2024aaai-open,
  title     = {{Open-Vocabulary Video Relation Extraction}},
  author    = {Tian, Wentao and Wang, Zheng and Fu, Yuqian and Chen, Jingjing and Cheng, Lechao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {5215-5223},
  doi       = {10.1609/AAAI.V38I6.28328},
  url       = {https://mlanthology.org/aaai/2024/tian2024aaai-open/}
}