F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions

Abstract

Existing 3D human object interaction (HOI) datasets and models simply align global descriptions with the long HOI sequence, while lacking a detailed understanding of intermediate states and the transitions between states. In this paper, we argue that fine-grained semantic alignment, which utilizes state-level descriptions, offers a promising paradigm for learning semantically rich HOI representations. To achieve this, we introduce Semantic-HOI, a new dataset comprising over 20K paired HOI states with fine-grained descriptions for each HOI state and the body movements that happen between two consecutive states. Leveraging the proposed dataset, we design three state-level HOI tasks to accomplish fine-grained semantic alignment within the HOI sequence. Additionally, we propose a unified model called , designed to leverage multimodal instructions and empower the Multi-modal Large Language Model to efficiently handle diverse HOI tasks. offers multiple advantages: (1) It employs a unified task formulation that supports the use of versatile multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and linguistic spaces. (3) It utilizes fine-grained textual supervision for direct optimization, avoiding intricate modeling of HOI states. Extensive experiments reveal that effectively aligns HOI states with fine-grained semantic descriptions, adeptly tackling understanding, reasoning, generation, and reconstruction tasks.

Cite

Text

Yang et al. "F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72913-3_6

Markdown

[Yang et al. "F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yang2024eccv-fhoi/) doi:10.1007/978-3-031-72913-3_6

BibTeX

@inproceedings{yang2024eccv-fhoi,
  title     = {{F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions}},
  author    = {Yang, Jie and Niu, Xuesong and Jiang, Nan and Zhang, Ruimao and Huang, Siyuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72913-3_6},
  url       = {https://mlanthology.org/eccv/2024/yang2024eccv-fhoi/}
}