DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation
Abstract
Despite the great potential in capturing long-range dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer.
Cite
Text
Wang et al. "DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28394Markdown
[Wang et al. "DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/wang2024aaai-dtmformer/) doi:10.1609/AAAI.V38I6.28394BibTeX
@inproceedings{wang2024aaai-dtmformer,
title = {{DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation}},
author = {Wang, Zhehao and Lin, Xian and Wu, Nannan and Yu, Li and Cheng, Kwang-Ting and Yan, Zengqiang},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2024},
pages = {5814-5822},
doi = {10.1609/AAAI.V38I6.28394},
url = {https://mlanthology.org/aaai/2024/wang2024aaai-dtmformer/}
}