Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task that has numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach that incorporates three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Cite

Text

Zhang et al. "Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention." Neural Information Processing Systems, 2024. doi:10.52202/079017-3917

Markdown

[Zhang et al. "Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-multiobject/) doi:10.52202/079017-3917

BibTeX

@inproceedings{zhang2024neurips-multiobject,
  title     = {{Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention}},
  author    = {Zhang, Haomeng and Yang, Chiao-An and Yeh, Raymond A.},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3917},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-multiobject/}
}