Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Abstract

Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by a user-specified motion trajectory. To support this task, we introduce a new dataset that is curated semi-automatically, an efficient identity-preserving motion-controllable video Diffusion Transformer architecture, and a comprehensive evaluation protocol targeting this task. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Cite

Text

Wang et al. "Frame In-N-Out: Unbounded Controllable Image-to-Video Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Frame In-N-Out: Unbounded Controllable Image-to-Video Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-frame/)

BibTeX

@inproceedings{wang2025neurips-frame,
  title     = {{Frame In-N-Out: Unbounded Controllable Image-to-Video Generation}},
  author    = {Wang, Boyang and Chen, Xuweiyi and Gadelha, Matheus and Cheng, Zezhou},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-frame/}
}