ZigMa: A DiT-Style Zigzag Mamba Diffusion Model

Abstract

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024×1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256×256 .

Cite

Text

Hu et al. "ZigMa: A DiT-Style Zigzag Mamba Diffusion Model." ICML 2024 Workshops: LCFM, 2024.

Markdown

[Hu et al. "ZigMa: A DiT-Style Zigzag Mamba Diffusion Model." ICML 2024 Workshops: LCFM, 2024.](https://mlanthology.org/icmlw/2024/hu2024icmlw-zigma/)

BibTeX

@inproceedings{hu2024icmlw-zigma,
  title     = {{ZigMa: A DiT-Style Zigzag Mamba Diffusion Model}},
  author    = {Hu, Vincent Tao and Baumann, Stefan Andreas and Gui, Ming and Grebenkova, Olga and Ma, Pingchuan and Schusterbauer, Johannes and Ommer, Björn},
  booktitle = {ICML 2024 Workshops: LCFM},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/hu2024icmlw-zigma/}
}