EgoExo-Gen: Ego-Centric Video Prediction by Watching Exo-Centric Videos

Abstract

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate future frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop a fully automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the public Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Cite

Text

Xu et al. "EgoExo-Gen: Ego-Centric Video Prediction by Watching Exo-Centric Videos." International Conference on Learning Representations, 2025.

Markdown

[Xu et al. "EgoExo-Gen: Ego-Centric Video Prediction by Watching Exo-Centric Videos." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xu2025iclr-egoexogen/)

BibTeX

@inproceedings{xu2025iclr-egoexogen,
  title     = {{EgoExo-Gen: Ego-Centric Video Prediction by Watching Exo-Centric Videos}},
  author    = {Xu, Jilan and Huang, Yifei and Pei, Baoqi and Hou, Junlin and Li, Qingqiu and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/xu2025iclr-egoexogen/}
}