SGFormer: Semantic-Geometry Fusion Transformer for Multi-Modal 3D Panoptic Segmentation

Abstract

Modern methods for autonomous driving perception widely adopt multi-modal fusion to enhance 3D scene understanding. However, existing methods suffer from inferior semantic extraction in image encoders that treat all pixels equally, ignoring contextual differences. The generated multi-modal representations also typically lack comprehensive semantic and spatial geometry information, which is crucial for the 3D panoptic segmentation task. In this paper, we propose a novel Semantic-Geometry Fusion Transformer (SGFormer) that extracts adaptive semantic contexts, aggregates geometric information and captures the semantic-geometry fusion. First, in the Image Branch, we tailor semantic contexts for each pixel with context-guided attention and spatial context alignment to refine semantic details. Second, we transform image and voxel features into point-pixel geometry representations, simultaneously learning semantic category priors as embeddings to better represent scene geometry and semantics. Finally, to aggregate semantic information with related geometry, we design a semantic-geometry fusion that combines the transformer, effectively capturing semantic-geometry relationships into multi-modal panoptic representations. Notably, SGFormer achieves the state-of-the-art (SOTA) results on the nuScenes and SemanticPOSS, as well as yielding competitive performance on the SemanticKITTI. Moreover, SGFormer exhibits superior robustness compared to leading methods, marking an improvement of 2% to 10%.

Cite

Text

Yu et al. "SGFormer: Semantic-Geometry Fusion Transformer for Multi-Modal 3D Panoptic Segmentation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I9.33042

Markdown

[Yu et al. "SGFormer: Semantic-Geometry Fusion Transformer for Multi-Modal 3D Panoptic Segmentation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/yu2025aaai-sgformer/) doi:10.1609/AAAI.V39I9.33042

BibTeX

@inproceedings{yu2025aaai-sgformer,
  title     = {{SGFormer: Semantic-Geometry Fusion Transformer for Multi-Modal 3D Panoptic Segmentation}},
  author    = {Yu, Hongqi and Chan, Sixian and Zhou, Xiaolong and Zhang, Xiaoqin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {9616-9625},
  doi       = {10.1609/AAAI.V39I9.33042},
  url       = {https://mlanthology.org/aaai/2025/yu2025aaai-sgformer/}
}