Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Liu, Qihao; Zeng, Zhanpeng; He, Ju; Yu, Qihang; Shen, Xiaohui; Chen, Liang-Chieh

doi:10.52202/079017-4254

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen

NeurIPS 2024

doi:10.52202/079017-4254 /neurips/2024/liu2024neurips-alleviating/

Abstract

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the **Di**ffusion model with the **M**ulti-**R**esolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants surpass previous diffusion models, achieving FID scores of 1.70 on ImageNet $256 \times 256$ and 2.89 on ImageNet $512 \times 512$. Our best variant, DiMR-G, further establishes a state-of-the-art 1.63 FID on ImageNet $256 \times 256$.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Liu et al. "Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization." Neural Information Processing Systems, 2024. doi:10.52202/079017-4254

Markdown

[Liu et al. "Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/liu2024neurips-alleviating/) doi:10.52202/079017-4254

BibTeX

@inproceedings{liu2024neurips-alleviating,
  title     = {{Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization}},
  author    = {Liu, Qihao and Zeng, Zhanpeng and He, Ju and Yu, Qihang and Shen, Xiaohui and Chen, Liang-Chieh},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4254},
  url       = {https://mlanthology.org/neurips/2024/liu2024neurips-alleviating/}
}