Target Adaptive Context Aggregation for Video Scene Graph Generation

Yao Teng, Limin Wang, Zhifeng Li, Gangshan Wu

ICCV 2021 pp. 13688-13697

doi:10.1109/ICCV48922.2021.01343 /iccv/2021/teng2021iccv-target/

Abstract

This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as Target Adaptive Context Aggregation Network (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at https://github.com/MCG-NJU/TRACE.

PDF ICCV Semantic Scholar

Cite

Text

Teng et al. "Target Adaptive Context Aggregation for Video Scene Graph Generation." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.01343

Markdown

[Teng et al. "Target Adaptive Context Aggregation for Video Scene Graph Generation." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/teng2021iccv-target/) doi:10.1109/ICCV48922.2021.01343

BibTeX

@inproceedings{teng2021iccv-target,
  title     = {{Target Adaptive Context Aggregation for Video Scene Graph Generation}},
  author    = {Teng, Yao and Wang, Limin and Li, Zhifeng and Wu, Gangshan},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {13688-13697},
  doi       = {10.1109/ICCV48922.2021.01343},
  url       = {https://mlanthology.org/iccv/2021/teng2021iccv-target/}
}