Skew-Robust Human-Object Interactions in Videos
Abstract
Humans are, arguably, one of the most important regions of interest in a visual analysis pipeline. Detecting how the human interacts with the surrounding environment, thus, becomes an important problem and has several potential use-cases. While this has been adequately addressed in the literature in the image setting, there exist very few methods addressing the case for in-the-wild videos. The problem is further exacerbated by the high degree of label skew. To this end, we propose SeRVo-HOI, a robust end-to-end framework for recognizing human-object interactions from a video, particularly in high label-skew settings. The network contextualizes multiple image representations and is trained to explicitly handle dataset skew. We propose and analyse methods to address the long-tail distribution of the labels and show improvements on the tail-labels. SeRVo-HOI outperforms the state-of-the-art by a significant margin 21.1% vs 17.6% mAP on the large-scale, in-the-wild VidHOI dataset while particularly demonstrating solid improvements in the tail-classes 19.9% vs 17.3% mAP.
Cite
Text
Agarwal et al. "Skew-Robust Human-Object Interactions in Videos." Winter Conference on Applications of Computer Vision, 2023.Markdown
[Agarwal et al. "Skew-Robust Human-Object Interactions in Videos." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/agarwal2023wacv-skewrobust/)BibTeX
@inproceedings{agarwal2023wacv-skewrobust,
title = {{Skew-Robust Human-Object Interactions in Videos}},
author = {Agarwal, Apoorva and Dabral, Rishabh and Jain, Arjun and Ramakrishnan, Ganesh},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2023},
pages = {5098-5107},
url = {https://mlanthology.org/wacv/2023/agarwal2023wacv-skewrobust/}
}