DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation

Abstract

Despite the great potential in capturing long-range dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer.

Cite

Text

Wang et al. "DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28394

Markdown

[Wang et al. "DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/wang2024aaai-dtmformer/) doi:10.1609/AAAI.V38I6.28394

BibTeX

@inproceedings{wang2024aaai-dtmformer,
  title     = {{DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation}},
  author    = {Wang, Zhehao and Lin, Xian and Wu, Nannan and Yu, Li and Cheng, Kwang-Ting and Yan, Zengqiang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {5814-5822},
  doi       = {10.1609/AAAI.V38I6.28394},
  url       = {https://mlanthology.org/aaai/2024/wang2024aaai-dtmformer/}
}