Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
Abstract
Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a 5× training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
Cite
Text
Guo et al. "Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72764-1_3Markdown
[Guo et al. "Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/guo2024eccv-make/) doi:10.1007/978-3-031-72764-1_3BibTeX
@inproceedings{guo2024eccv-make,
title = {{Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation}},
author = {Guo, Lanqing and He, Yingqing and Chen, Haoxin and Xia, Menghan and Cun, Xiaodong and Wang, Yufei and Huang, Siyu and Zhang, Yong and Wang, Xintao and Chen, Qifeng and Shan, Ying and Wen, Bihan},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72764-1_3},
url = {https://mlanthology.org/eccv/2024/guo2024eccv-make/}
}