Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-Invariant Representation Learning

Abstract

Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical for applications in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignments between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and egocentric video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.

Cite

Text

Luo et al. "Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-Invariant Representation Learning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01473

Markdown

[Luo et al. "Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-Invariant Representation Learning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/luo2025cvpr-viewpoint/) doi:10.1109/CVPR52734.2025.01473

BibTeX

@inproceedings{luo2025cvpr-viewpoint,
  title     = {{Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-Invariant Representation Learning}},
  author    = {Luo, Mi and Xue, Zihui and Dimakis, Alex and Grauman, Kristen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {15802-15812},
  doi       = {10.1109/CVPR52734.2025.01473},
  url       = {https://mlanthology.org/cvpr/2025/luo2025cvpr-viewpoint/}
}