RoMo: Robust Motion Segmentation Improves Structure from Motion

Abstract

There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. Estimating accurate camera poses from videos through structure-from-motion (SfM) relies on robustly separating static and dynamic parts of a video. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

Cite

Text

Goli et al. "RoMo: Robust Motion Segmentation Improves Structure from Motion." International Conference on Computer Vision, 2025.

Markdown

[Goli et al. "RoMo: Robust Motion Segmentation Improves Structure from Motion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/goli2025iccv-romo/)

BibTeX

@inproceedings{goli2025iccv-romo,
  title     = {{RoMo: Robust Motion Segmentation Improves Structure from Motion}},
  author    = {Goli, Lily and Sabour, Sara and Matthews, Mark and Brubaker, Marcus A. and Lagun, Dmitry and Jacobson, Alec and Fleet, David J. and Saxena, Saurabh and Tagliasacchi, Andrea},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6155-6164},
  url       = {https://mlanthology.org/iccv/2025/goli2025iccv-romo/}
}