CTRL&SHIFT: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Abstract

Object-level manipulation—relocating or reorienting objects in images or videos while preserving scene realism—is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present **Ctrl&Shift**, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages—object removal and reference-guided inpainting under explicit camera pose control—and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that **Ctrl&Shift** achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. *To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation—without relying on any explicit 3D modeling.*

Cite

Text

Ruan et al. "CTRL&SHIFT: High-Quality Geometry-Aware Object Manipulation in Visual Generation." International Conference on Learning Representations, 2026.

Markdown

[Ruan et al. "CTRL&SHIFT: High-Quality Geometry-Aware Object Manipulation in Visual Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ruan2026iclr-ctrl/)

BibTeX

@inproceedings{ruan2026iclr-ctrl,
  title     = {{CTRL&SHIFT: High-Quality Geometry-Aware Object Manipulation in Visual Generation}},
  author    = {Ruan, Penghui and Zi, Bojia and Qi, Xianbiao and Huang, Youze and Xiao, Rong and Wang, Pichao and Cao, Jiannong and Shi, Yuhui},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ruan2026iclr-ctrl/}
}