Instruction-Based Image Manipulation by Watching How Things Move

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics--such as non-rigid subject motion and complex camera movements--that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

Cite

Text

Cao et al. "Instruction-Based Image Manipulation by Watching How Things Move." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00258

Markdown

[Cao et al. "Instruction-Based Image Manipulation by Watching How Things Move." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/cao2025cvpr-instructionbased/) doi:10.1109/CVPR52734.2025.00258

BibTeX

@inproceedings{cao2025cvpr-instructionbased,
  title     = {{Instruction-Based Image Manipulation by Watching How Things Move}},
  author    = {Cao, Mingdeng and Zhang, Xuaner and Zheng, Yinqiang and Xia, Zhihao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {2704-2713},
  doi       = {10.1109/CVPR52734.2025.00258},
  url       = {https://mlanthology.org/cvpr/2025/cao2025cvpr-instructionbased/}
}