What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions

Abstract

We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.

Cite

Text

Iftekhar et al. "What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00528

Markdown

[Iftekhar et al. "What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/iftekhar2022cvpr-look/) doi:10.1109/CVPR52688.2022.00528

BibTeX

@inproceedings{iftekhar2022cvpr-look,
  title     = {{What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions}},
  author    = {Iftekhar, A S M and Chen, Hao and Kundu, Kaustav and Li, Xinyu and Tighe, Joseph and Modolo, Davide},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {5353-5363},
  doi       = {10.1109/CVPR52688.2022.00528},
  url       = {https://mlanthology.org/cvpr/2022/iftekhar2022cvpr-look/}
}