XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, Eshed Ohn-Bar

ICCV 2023 pp. 10094-10105

doi:10.1109/ICCV51070.2023.00926 /iccv/2023/lai2023iccv-xvo/

Abstract

We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning.

PDF ICCV Semantic Scholar

Cite

Text

Lai et al. "XVO: Generalized Visual Odometry via Cross-Modal Self-Training." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00926

Markdown

[Lai et al. "XVO: Generalized Visual Odometry via Cross-Modal Self-Training." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/lai2023iccv-xvo/) doi:10.1109/ICCV51070.2023.00926

BibTeX

@inproceedings{lai2023iccv-xvo,
  title     = {{XVO: Generalized Visual Odometry via Cross-Modal Self-Training}},
  author    = {Lai, Lei and Shangguan, Zhongkai and Zhang, Jimuyang and Ohn-Bar, Eshed},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {10094-10105},
  doi       = {10.1109/ICCV51070.2023.00926},
  url       = {https://mlanthology.org/iccv/2023/lai2023iccv-xvo/}
}