End-to-End 3D Dense Captioning with Vote2Cap-DETR
Abstract
3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated "detect-then-describe" pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in [email protected], respectively. Codes will be released soon.
Cite
Text
Chen et al. "End-to-End 3D Dense Captioning with Vote2Cap-DETR." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01070Markdown
[Chen et al. "End-to-End 3D Dense Captioning with Vote2Cap-DETR." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/chen2023cvpr-endtoend/) doi:10.1109/CVPR52729.2023.01070BibTeX
@inproceedings{chen2023cvpr-endtoend,
title = {{End-to-End 3D Dense Captioning with Vote2Cap-DETR}},
author = {Chen, Sijin and Zhu, Hongyuan and Chen, Xin and Lei, Yinjie and Yu, Gang and Chen, Tao},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {11124-11133},
doi = {10.1109/CVPR52729.2023.01070},
url = {https://mlanthology.org/cvpr/2023/chen2023cvpr-endtoend/}
}