EgoTwin: Dreaming Body and View in First Person

Abstract

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework. Qualitative results are available on our project page: https://egotwin.pages.dev/.

Cite

Text

Xiu et al. "EgoTwin: Dreaming Body and View in First Person." International Conference on Learning Representations, 2026.

Markdown

[Xiu et al. "EgoTwin: Dreaming Body and View in First Person." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xiu2026iclr-egotwin/)

BibTeX

@inproceedings{xiu2026iclr-egotwin,
  title     = {{EgoTwin: Dreaming Body and View in First Person}},
  author    = {Xiu, Jingqiao and Hong, Fangzhou and Li, Yicong and Li, Mengze and Wang, Wentao and Han, Sirui and Pan, Liang and Liu, Ziwei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xiu2026iclr-egotwin/}
}