TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Abstract

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

Cite

Text

Zholus et al. "TAPNext: Tracking Any Point (TAP) as Next Token Prediction." International Conference on Computer Vision, 2025.

Markdown

[Zholus et al. "TAPNext: Tracking Any Point (TAP) as Next Token Prediction." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zholus2025iccv-tapnext/)

BibTeX

@inproceedings{zholus2025iccv-tapnext,
  title     = {{TAPNext: Tracking Any Point (TAP) as Next Token Prediction}},
  author    = {Zholus, Artem and Doersch, Carl and Yang, Yi and Koppula, Skanda and Patraucean, Viorica and He, Xu Owen and Rocco, Ignacio and Sajjadi, Mehdi S. M. and Chandar, Sarath and Goroshin, Ross},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {9693-9703},
  url       = {https://mlanthology.org/iccv/2025/zholus2025iccv-tapnext/}
}