TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

Xu, Wenting; Ila, Viorela; Zhou, Luping; Jin, Craig T.

doi:10.1609/AAAI.V39I9.32969

TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

Wenting Xu, Viorela Ila, Luping Zhou, Craig T. Jin

AAAI 2025 pp. 8960-8968

doi:10.1609/AAAI.V39I9.32969 /aaai/2025/xu2025aaai-tb/

Abstract

The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a Transformer Based Hierarchical Scene Understanding (TB-HSU) model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.

PDF AAAI Semantic Scholar

Cite

Text

Xu et al. "TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I9.32969

Markdown

[Xu et al. "TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/xu2025aaai-tb/) doi:10.1609/AAAI.V39I9.32969

BibTeX

@inproceedings{xu2025aaai-tb,
  title     = {{TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances}},
  author    = {Xu, Wenting and Ila, Viorela and Zhou, Luping and Jin, Craig T.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8960-8968},
  doi       = {10.1609/AAAI.V39I9.32969},
  url       = {https://mlanthology.org/aaai/2025/xu2025aaai-tb/}
}