Segment Any Motion in Videos
Abstract
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.
Cite
Text
Huang et al. "Segment Any Motion in Videos." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00323Markdown
[Huang et al. "Segment Any Motion in Videos." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/huang2025cvpr-segment/) doi:10.1109/CVPR52734.2025.00323BibTeX
@inproceedings{huang2025cvpr-segment,
title = {{Segment Any Motion in Videos}},
author = {Huang, Nan and Zheng, Wenzhao and Xu, Chenfeng and Keutzer, Kurt and Zhang, Shanghang and Kanazawa, Angjoo and Wang, Qianqian},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3406-3416},
doi = {10.1109/CVPR52734.2025.00323},
url = {https://mlanthology.org/cvpr/2025/huang2025cvpr-segment/}
}