Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans

Abstract

We introduce the new task of dense captioning in RGB-D scans. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detecting and describing problem at the same time, we propose Scan2Cap, an end-to-end trained architecture, to detect objects in the input scene and generate the descriptions for all of them in natural language. We apply an attention-based captioning method to generate descriptive tokens while referring to the related components in the local context. To better handle the relative spatial relations between objects, a message passing graph module is applied to learn the relation features, which are later used in the captioning phase. On the recently proposed ScanRefer dataset, we show that our architecture can effectively localize and describe the 3D objects in the scene. It also outperforms the 2D-based methods on the 3D dense captioning task by a big margin.

Cite

Text

Chen et al. "Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00321

Markdown

[Chen et al. "Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/chen2021cvpr-scan2cap/) doi:10.1109/CVPR46437.2021.00321

BibTeX

@inproceedings{chen2021cvpr-scan2cap,
  title     = {{Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans}},
  author    = {Chen, Zhenyu and Gholami, Ali and Niessner, Matthias and Chang, Angel X.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {3193-3203},
  doi       = {10.1109/CVPR46437.2021.00321},
  url       = {https://mlanthology.org/cvpr/2021/chen2021cvpr-scan2cap/}
}