Detecting Human-Object Relationships in Videos
Abstract
We study a crucial problem in video analysis: human-object relationship detection. The majority of previous approaches are developed only for the static image scenario, without incorporating the temporal dynamics so vital to contextualizing human-object relationships. We propose a model with Intra- and Inter-Transformers, enabling joint spatial and temporal reasoning on multiple visual concepts of objects, relationships, and human poses. We find that applying attention mechanisms among features distributed spatio-temporally greatly improves our understanding of human-object relationships. Our method is validated on two datasets, Action Genome and CAD-120-EVAR, and achieves state-of-the-art performance on both of them.
Cite
Text
Ji et al. "Detecting Human-Object Relationships in Videos." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00800Markdown
[Ji et al. "Detecting Human-Object Relationships in Videos." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/ji2021iccv-detecting/) doi:10.1109/ICCV48922.2021.00800BibTeX
@inproceedings{ji2021iccv-detecting,
title = {{Detecting Human-Object Relationships in Videos}},
author = {Ji, Jingwei and Desai, Rishi and Niebles, Juan Carlos},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {8106-8116},
doi = {10.1109/ICCV48922.2021.00800},
url = {https://mlanthology.org/iccv/2021/ji2021iccv-detecting/}
}