Weakly-Supervised HOI Detection via Prior-Guided Bi-Level Representation Learning

Abstract

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.

Cite

Text

Wan et al. "Weakly-Supervised HOI Detection via Prior-Guided Bi-Level Representation Learning." International Conference on Learning Representations, 2023.

Markdown

[Wan et al. "Weakly-Supervised HOI Detection via Prior-Guided Bi-Level Representation Learning." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/wan2023iclr-weaklysupervised/)

BibTeX

@inproceedings{wan2023iclr-weaklysupervised,
  title     = {{Weakly-Supervised HOI Detection via Prior-Guided Bi-Level Representation Learning}},
  author    = {Wan, Bo and Liu, Yongfei and Zhou, Desen and Tuytelaars, Tinne and He, Xuming},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/wan2023iclr-weaklysupervised/}
}