Move Anything with Layered Scene Diffusion

Abstract

Diffusion models generate images with an unprecedented level of quality but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes but these methods do not apply to diffusion models due to their fixed forward process. In this work we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations including moving resizing cloning and layer-wise appearance editing operations including object restyling and replacing. Moreover a scene can be generated conditioned on a reference image thus enabling object moving for in-the-wild images. Notably this approach is training-free compatible with general text-to-image diffusion models and responsive in less than a second.

Cite

Text

Ren et al. "Move Anything with Layered Scene Diffusion." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00610

Markdown

[Ren et al. "Move Anything with Layered Scene Diffusion." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ren2024cvpr-move/) doi:10.1109/CVPR52733.2024.00610

BibTeX

@inproceedings{ren2024cvpr-move,
  title     = {{Move Anything with Layered Scene Diffusion}},
  author    = {Ren, Jiawei and Xu, Mengmeng and Wu, Jui-Chieh and Liu, Ziwei and Xiang, Tao and Toisoul, Antoine},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {6380-6389},
  doi       = {10.1109/CVPR52733.2024.00610},
  url       = {https://mlanthology.org/cvpr/2024/ren2024cvpr-move/}
}