ZigMa: A DiT-Style Zigzag Mamba Diffusion Model
Abstract
The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce Zigzag Mamba, a simple, plug-and-play, minimal-parameter burden, DiT style solution, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines, also this heterogeneous layerwise scan enables zero memory and speed burden when we consider more scan paths. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024 × 1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256 × 256.
Cite
Text
Hu et al. "ZigMa: A DiT-Style Zigzag Mamba Diffusion Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72664-4_9Markdown
[Hu et al. "ZigMa: A DiT-Style Zigzag Mamba Diffusion Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/hu2024eccv-zigma/) doi:10.1007/978-3-031-72664-4_9BibTeX
@inproceedings{hu2024eccv-zigma,
title = {{ZigMa: A DiT-Style Zigzag Mamba Diffusion Model}},
author = {Hu, Vincent Tao and Baumann, Stefan A and Gui, Ming and Grebenkova, Olga and Ma, Pingchuan and Fischer, Johannes S and Ommer, Bjorn},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72664-4_9},
url = {https://mlanthology.org/eccv/2024/hu2024eccv-zigma/}
}