Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
Abstract
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces F ast M amba for V ision ( Famba-V ), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.
Cite
Text
Shen et al. "Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91979-4_20Markdown
[Shen et al. "Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/shen2024eccvw-fambav/) doi:10.1007/978-3-031-91979-4_20BibTeX
@inproceedings{shen2024eccvw-fambav,
title = {{Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion}},
author = {Shen, Hui and Wan, Zhongwei and Wang, Xin and Zhang, Mi},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {268-278},
doi = {10.1007/978-3-031-91979-4_20},
url = {https://mlanthology.org/eccvw/2024/shen2024eccvw-fambav/}
}