VideoVAE+: Large Motion Video Autoencoding with Cross-Modal Video VAE

Xing, Yazhou; Fei, Yang; He, Yingqing; Chen, Jingye; Xie, Jiaxin; Chi, Xiaowei; Chen, Qifeng

VideoVAE+: Large Motion Video Autoencoding with Cross-Modal Video VAE

Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen

ICCV 2025 pp. 17951-17960

/iccv/2025/xing2025iccv-videovae/

Abstract

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation results in temporal inconsistencies and fails to compress temporal redundancy effectively. Existing works on Video VAEs compress temporal redundancy but struggle to handle videos with large motion effectively. They suffer from issues such as severe image blur and loss of detail in scenarios with large motion. In this paper, we present a powerful video VAE named VideoVAE+ that effectively reconstructs videos with large motion. First, we investigate two architecture choices and propose our simple yet effective architecture with better spatiotemporal joint modeling performance. Second, we propose to leverage the textual information in existing text-to-video datasets and incorporate text guidance during training. The textural guidance is optional during inference. We find that this design enhances the reconstruction quality and preservation of detail. Finally, our models achieve strong performance compared with various baseline approaches in both general videos and large motion videos, demonstrating its effectiveness on the challenging large motion scenarios.

PDF ICCV Semantic Scholar

Cite

Text

Xing et al. "VideoVAE+: Large Motion Video Autoencoding with Cross-Modal Video VAE." International Conference on Computer Vision, 2025.

Markdown

[Xing et al. "VideoVAE+: Large Motion Video Autoencoding with Cross-Modal Video VAE." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xing2025iccv-videovae/)

BibTeX

@inproceedings{xing2025iccv-videovae,
  title     = {{VideoVAE+: Large Motion Video Autoencoding with Cross-Modal Video VAE}},
  author    = {Xing, Yazhou and Fei, Yang and He, Yingqing and Chen, Jingye and Xie, Jiaxin and Chi, Xiaowei and Chen, Qifeng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17951-17960},
  url       = {https://mlanthology.org/iccv/2025/xing2025iccv-videovae/}
}