GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-Based Transformer

Abstract

Lidar-based 3D detection is one of the most popular research fields in autonomous driving. 3D detectors typically detect specific targets in a scene according to the pattern formed by the spatial distribution of point clouds. However, existing voxel-based methods usually adopt MLP and global pooling (e.g., PointNet, CenterPoint) as voxel feature encoder, which makes it less effective to extract detailed spatial structure information from raw points, leading to information loss and inferior performance. In this paper, we propose a novel graph-based transformer to encode voxel features by condensing the full and detailed point's geometry, termed as GeoFormer. We first represent points within a voxel as a graph, based on relative distances to capture its spatial geometry. Then, We introduce a geometry-guided transformer architecture to encode voxel features, where the adjacent geometric clues are used to re-weight point feature similarities, enabling more effective extraction of geometric relationships between point pairs at varying distances. We highlight that GeoFormer is a plug-and-play module which can be seamlessly integrated to enhance the performance of existing voxel-based detectors. Extensive experiments conducted on three popular outdoor datasets demonstrate that our GeoFormer achieves the start-of-the-art performance on both effectiveness and robustness comparisons.

Cite

Text

Jin et al. "GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-Based Transformer." International Conference on Computer Vision, 2025.

Markdown

[Jin et al. "GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-Based Transformer." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/jin2025iccv-geoformer/)

BibTeX

@inproceedings{jin2025iccv-geoformer,
  title     = {{GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-Based Transformer}},
  author    = {Jin, Xin and Su, Haisheng and Ma, Cong and Liu, Kai and Wu, Wei and Hui, Fei and Yan, Junchi},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {26879-26889},
  url       = {https://mlanthology.org/iccv/2025/jin2025iccv-geoformer/}
}