All Are Worth Words: A ViT Backbone for Score-Based Diffusion Models
Abstract
Vision transformers (ViT) have shown promise in various vision tasks including low-level ones while the U-Net remains dominant in score-based diffusion models. In this paper, we perform a systematical empirical study on the ViT-based architectures in diffusion models. Our results suggest that adding extra long skip connections (like the U-Net) to ViT is crucial to diffusion models. The new ViT architecture, together with other improvements, is referred to as U-ViT. On several popular visual datasets, U-ViT achieves competitive generation results to SOTA U-Net while requiring comparable amount of parameters and computation if not less.
Cite
Text
Bao et al. "All Are Worth Words: A ViT Backbone for Score-Based Diffusion Models." NeurIPS 2022 Workshops: SBM, 2022.Markdown
[Bao et al. "All Are Worth Words: A ViT Backbone for Score-Based Diffusion Models." NeurIPS 2022 Workshops: SBM, 2022.](https://mlanthology.org/neuripsw/2022/bao2022neuripsw-all/)BibTeX
@inproceedings{bao2022neuripsw-all,
title = {{All Are Worth Words: A ViT Backbone for Score-Based Diffusion Models}},
author = {Bao, Fan and Li, Chongxuan and Cao, Yue and Zhu, Jun},
booktitle = {NeurIPS 2022 Workshops: SBM},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/bao2022neuripsw-all/}
}