SAM4D: Segment Anything in Camera and LiDAR Streams

Abstract

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

Cite

Text

Xu et al. "SAM4D: Segment Anything in Camera and LiDAR Streams." International Conference on Computer Vision, 2025.

Markdown

[Xu et al. "SAM4D: Segment Anything in Camera and LiDAR Streams." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xu2025iccv-sam4d/)

BibTeX

@inproceedings{xu2025iccv-sam4d,
  title     = {{SAM4D: Segment Anything in Camera and LiDAR Streams}},
  author    = {Xu, Jianyun and Wang, Song and Ni, Ziqian and Hu, Chunyong and Yang, Sheng and Zhu, Jianke and Li, Qiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {28535-28545},
  url       = {https://mlanthology.org/iccv/2025/xu2025iccv-sam4d/}
}