Open-World Human-Object Interaction Detection via Multi-Modal Prompts

Abstract

In this paper we develop MP-HOI a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions realizing HOI detection in the open world. Specifically it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training we build a large-scale HOI dataset named Magic-HOI which gathers six existing datasets into a unified label space forming over 186K images with 2.4K objects 1.2K actions and 20K HOI interactions. Furthermore to tackle the long-tail issue within the Magic-HOI dataset we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks. Our project homepage is available at https://MP-HOI.github.io/.

Cite

Text

Yang et al. "Open-World Human-Object Interaction Detection via Multi-Modal Prompts." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01604

Markdown

[Yang et al. "Open-World Human-Object Interaction Detection via Multi-Modal Prompts." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yang2024cvpr-openworld/) doi:10.1109/CVPR52733.2024.01604

BibTeX

@inproceedings{yang2024cvpr-openworld,
  title     = {{Open-World Human-Object Interaction Detection via Multi-Modal Prompts}},
  author    = {Yang, Jie and Li, Bingliang and Zeng, Ailing and Zhang, Lei and Zhang, Ruimao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {16954-16964},
  doi       = {10.1109/CVPR52733.2024.01604},
  url       = {https://mlanthology.org/cvpr/2024/yang2024cvpr-openworld/}
}