Generative Powers of Ten

Abstract

We present a method that uses a text-to-image model to generate consistent content across multiple image scales enabling extreme semantic zooms into a scene e.g. ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting and show that our method is most effective at generating consistent multi-scale content.

Cite

Text

Wang et al. "Generative Powers of Ten." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00685

Markdown

[Wang et al. "Generative Powers of Ten." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wang2024cvpr-generative/) doi:10.1109/CVPR52733.2024.00685

BibTeX

@inproceedings{wang2024cvpr-generative,
  title     = {{Generative Powers of Ten}},
  author    = {Wang, Xiaojuan and Kontkanen, Janne and Curless, Brian and Seitz, Steven M. and Kemelmacher-Shlizerman, Ira and Mildenhall, Ben and Srinivasan, Pratul and Verbin, Dor and Holynski, Aleksander},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {7173-7182},
  doi       = {10.1109/CVPR52733.2024.00685},
  url       = {https://mlanthology.org/cvpr/2024/wang2024cvpr-generative/}
}