VRDFormer: End-to-End Video Visual Relation Detection with Transformers

Abstract

Visual relation understanding plays an essential role for holistic video understanding. Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency. In this paper, we propose a transformerbased framework called VRDFormer to unify these decoupling stages. Our model exploits a query-based approach to autoregressively generate relation instances. We specifically design static queries and recurrent queries to enable efficient object pair tracking with spatio-temporal contexts. The model is jointly trained with object pair detection and relation classification. Extensive experiments on two benchmark datasets, ImageNet-VidVRD and VidOR, demonstrate the effectiveness of the proposed VRDFormer, which achieves the state-of-the-art performance on both relation detection and relation tagging tasks.

Cite

Text

Zheng et al. "VRDFormer: End-to-End Video Visual Relation Detection with Transformers." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01827

Markdown

[Zheng et al. "VRDFormer: End-to-End Video Visual Relation Detection with Transformers." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zheng2022cvpr-vrdformer/) doi:10.1109/CVPR52688.2022.01827

BibTeX

@inproceedings{zheng2022cvpr-vrdformer,
  title     = {{VRDFormer: End-to-End Video Visual Relation Detection with Transformers}},
  author    = {Zheng, Sipeng and Chen, Shizhe and Jin, Qin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {18836-18846},
  doi       = {10.1109/CVPR52688.2022.01827},
  url       = {https://mlanthology.org/cvpr/2022/zheng2022cvpr-vrdformer/}
}