What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions
Abstract
We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
Cite
Text
Iftekhar et al. "What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00528Markdown
[Iftekhar et al. "What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/iftekhar2022cvpr-look/) doi:10.1109/CVPR52688.2022.00528BibTeX
@inproceedings{iftekhar2022cvpr-look,
title = {{What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions}},
author = {Iftekhar, A S M and Chen, Hao and Kundu, Kaustav and Li, Xinyu and Tighe, Joseph and Modolo, Davide},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {5353-5363},
doi = {10.1109/CVPR52688.2022.00528},
url = {https://mlanthology.org/cvpr/2022/iftekhar2022cvpr-look/}
}