Toward Open-Set Human Object Interaction Detection

Abstract

This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.

Cite

Text

Wu et al. "Toward Open-Set Human Object Interaction Detection." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28422

Markdown

[Wu et al. "Toward Open-Set Human Object Interaction Detection." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/wu2024aaai-open/) doi:10.1609/AAAI.V38I6.28422

BibTeX

@inproceedings{wu2024aaai-open,
  title     = {{Toward Open-Set Human Object Interaction Detection}},
  author    = {Wu, Mingrui and Liu, Yuqi and Ji, Jiayi and Sun, Xiaoshuai and Ji, Rongrong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {6066-6073},
  doi       = {10.1609/AAAI.V38I6.28422},
  url       = {https://mlanthology.org/aaai/2024/wu2024aaai-open/}
}