Relationship Proposal Networks

Abstract

Image scene understanding requires learning the relationships between objects in the scene. A scene with many objects may have only a few individual interacting objects (e.g., in a party image with many people, only a handful of people might be speaking with each other). To detect all relationships, it would be inefficient to first detect all individual objects and then classify all pairs; not only is the number of all pairs quadratic, but classification requires limited object categories, which is not scalable for real-world images. In this paper we address these challenges by using pairs of related regions in images to train a relationship proposer that at test time produces a manageable number of related regions. We name our model the Relationship Proposal Network (Rel-PN). Like object proposals, our Rel-PN is class-agnostic and thus scalable to an open vocabulary of objects. We demonstrate the ability of our Rel-PN to localize relationships with only a few thousand proposals. We demonstrate its performance on the Visual Genome dataset and compare to other baselines that we designed. We also conduct experiments on a smaller subset of 5,000 images with over 37,000 related regions and show promising results.

Cite

Text

Zhang et al. "Relationship Proposal Networks." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.555

Markdown

[Zhang et al. "Relationship Proposal Networks." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/zhang2017cvpr-relationship/) doi:10.1109/CVPR.2017.555

BibTeX

@inproceedings{zhang2017cvpr-relationship,
  title     = {{Relationship Proposal Networks}},
  author    = {Zhang, Ji and Elhoseiny, Mohamed and Cohen, Scott and Chang, Walter and Elgammal, Ahmed},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2017},
  doi       = {10.1109/CVPR.2017.555},
  url       = {https://mlanthology.org/cvpr/2017/zhang2017cvpr-relationship/}
}