DiC: Rethinking Conv3x3 Designs in Diffusion Models
Abstract
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC
Cite
Text
Tian et al. "DiC: Rethinking Conv3x3 Designs in Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00236Markdown
[Tian et al. "DiC: Rethinking Conv3x3 Designs in Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/tian2025cvpr-dic/) doi:10.1109/CVPR52734.2025.00236BibTeX
@inproceedings{tian2025cvpr-dic,
title = {{DiC: Rethinking Conv3x3 Designs in Diffusion Models}},
author = {Tian, Yuchuan and Han, Jing and Wang, Chengcheng and Liang, Yuchen and Xu, Chao and Chen, Hanting},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {2469-2478},
doi = {10.1109/CVPR52734.2025.00236},
url = {https://mlanthology.org/cvpr/2025/tian2025cvpr-dic/}
}