MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders

Shah, Ketul; Crandall, Robert; Xu, Jie; Zhou, Peng; Pillai, Vipin; George, Marian; Bansal, Mayank; Chellappa, Rama

MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders

Ketul Shah, Robert Crandall, Jie Xu, Peng Zhou, Vipin Pillai, Marian George, Mayank Bansal, Rama Chellappa

TMLR 2026

/tmlr/2026/shah2026tmlr-mv2mae/

Abstract

Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present MV2MAE, a method for self-supervised learning from synchronized multi-view videos, built on the masked autoencoder framework. We introduce two key enhancements to better exploit multi-view video data. First, we design a cross-view reconstruction task that leverages a cross-attention-based decoder to reconstruct a target viewpoint video from source view. This helps in effectively injecting geometric information and yielding representations robust to viewpoint changes. Second, we introduce a controllable motion-weighted reconstruction loss which emphasizes dynamic regions and mitigates trivial reconstruction of static backgrounds. This improves temporal modeling and encourages learning more meaningful representations across views. MV2MAE achieves state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets among self-supervised approaches. In the more practical transfer learning setting, it delivers consistent gains of +2.0 -- 8.5% on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness and generalizability of our approach. Code: https://github.com/kshah33/mv2mae

PDF TMLR OpenReview Code Semantic Scholar

Cite

Text

Shah et al. "MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders." Transactions on Machine Learning Research, 2026.

Markdown

[Shah et al. "MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/shah2026tmlr-mv2mae/)

BibTeX

@article{shah2026tmlr-mv2mae,
  title     = {{MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders}},
  author    = {Shah, Ketul and Crandall, Robert and Xu, Jie and Zhou, Peng and Pillai, Vipin and George, Marian and Bansal, Mayank and Chellappa, Rama},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/shah2026tmlr-mv2mae/}
}