SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion

Abstract

We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50x faster.

Cite

Text

Xiao et al. "SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion." International Conference on Computer Vision, 2025.

Markdown

[Xiao et al. "SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xiao2025iccv-spatialtrackerv2/)

BibTeX

@inproceedings{xiao2025iccv-spatialtrackerv2,
  title     = {{SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion}},
  author    = {Xiao, Yuxi and Wang, Jianyuan and Xue, Nan and Karaev, Nikita and Makarov, Yuri and Kang, Bingyi and Zhu, Xing and Bao, Hujun and Shen, Yujun and Zhou, Xiaowei},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6726-6737},
  url       = {https://mlanthology.org/iccv/2025/xiao2025iccv-spatialtrackerv2/}
}