SGTR: End-to-End Scene Graph Generation with Transformer

Abstract

Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up two-stage or a point-based one-stage approach, which often suffers from high time complexity or sub-optimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To solve the problem, we develop a transformer-based end-to-end framework that first generates the entity and predicate proposal set, followed by inferring directed edges to form the relation triplets. In particular, we develop a new entity-aware predicate representation based on a structural predicate generator that leverages the compositional property of relationships. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on two challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. We hope our model can serve as a strong baseline for the Transformer-based scene graph generation. Code is available in https://github.com/Scarecrow0/SGTR

Cite

Text

Li et al. "SGTR: End-to-End Scene Graph Generation with Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01888

Markdown

[Li et al. "SGTR: End-to-End Scene Graph Generation with Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/li2022cvpr-sgtr/) doi:10.1109/CVPR52688.2022.01888

BibTeX

@inproceedings{li2022cvpr-sgtr,
  title     = {{SGTR: End-to-End Scene Graph Generation with Transformer}},
  author    = {Li, Rongjie and Zhang, Songyang and He, Xuming},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {19486-19496},
  doi       = {10.1109/CVPR52688.2022.01888},
  url       = {https://mlanthology.org/cvpr/2022/li2022cvpr-sgtr/}
}