Mask-Attention-Free Transformer for 3D Instance Segmentation

Abstract

Recently, transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved. Specifically, object queries are guided by the initial instance masks in the first cross-attention, and then iteratively refine themselves in a similar manner. However, we observe that the mask-attention pipeline usually leads to slow convergence due to low-recall initial instance masks. Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead. Through center regression, we effectively overcome the low-recall issue and perform cross-attention by imposing positional prior. To reach this goal, we develop a series of position-aware designs. First, we learn a spatial distribution of 3D locations as the initial position queries. They spread over the 3D space densely, and thus can easily capture the objects in a scene with a high recall. Moreover, we present relative position encoding for the cross-attention and iterative refinement for more accurate position queries. Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer.

Cite

Text

Lai et al. "Mask-Attention-Free Transformer for 3D Instance Segmentation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00342

Markdown

[Lai et al. "Mask-Attention-Free Transformer for 3D Instance Segmentation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/lai2023iccv-maskattentionfree/) doi:10.1109/ICCV51070.2023.00342

BibTeX

@inproceedings{lai2023iccv-maskattentionfree,
  title     = {{Mask-Attention-Free Transformer for 3D Instance Segmentation}},
  author    = {Lai, Xin and Yuan, Yuhui and Chu, Ruihang and Chen, Yukang and Hu, Han and Jia, Jiaya},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {3693-3703},
  doi       = {10.1109/ICCV51070.2023.00342},
  url       = {https://mlanthology.org/iccv/2023/lai2023iccv-maskattentionfree/}
}