Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Abstract

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations. Project page: \small{\url{https://chenshuang-zhang.github.io/projects/ted}}.

Cite

Text

Zhang et al. "Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-video/)

BibTeX

@inproceedings{zhang2025neurips-video,
  title     = {{Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision}},
  author    = {Zhang, Chenshuang and Zhang, Kang and Chung, Joon Son and Kweon, In So and Kim, Junmo and Mao, Chengzhi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-video/}
}