Detecting Human-Object Relationships in Videos

Abstract

We study a crucial problem in video analysis: human-object relationship detection. The majority of previous approaches are developed only for the static image scenario, without incorporating the temporal dynamics so vital to contextualizing human-object relationships. We propose a model with Intra- and Inter-Transformers, enabling joint spatial and temporal reasoning on multiple visual concepts of objects, relationships, and human poses. We find that applying attention mechanisms among features distributed spatio-temporally greatly improves our understanding of human-object relationships. Our method is validated on two datasets, Action Genome and CAD-120-EVAR, and achieves state-of-the-art performance on both of them.

Cite

Text

Ji et al. "Detecting Human-Object Relationships in Videos." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00800

Markdown

[Ji et al. "Detecting Human-Object Relationships in Videos." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/ji2021iccv-detecting/) doi:10.1109/ICCV48922.2021.00800

BibTeX

@inproceedings{ji2021iccv-detecting,
  title     = {{Detecting Human-Object Relationships in Videos}},
  author    = {Ji, Jingwei and Desai, Rishi and Niebles, Juan Carlos},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {8106-8116},
  doi       = {10.1109/ICCV48922.2021.00800},
  url       = {https://mlanthology.org/iccv/2021/ji2021iccv-detecting/}
}