Mean-Shift Feature Transformer

Abstract

Transformer models developed in NLP make a great impact on computer vision fields producing promising performance on various tasks. While multi-head attention a characteristic mechanism of the transformer attracts keen research interest such as for reducing computation cost we analyze the transformer model from a viewpoint of feature transformation based on a distribution of input feature tokens. The analysis inspires us to derive a novel transformation method from mean-shift update which is an effective gradient ascent to seek a local mode of distinctive representation on the token distribution. We also present an efficient projection approach to reduce parameter size of linear projections constituting the proposed multi-head feature transformation. In the experiments on ImageNet-1K dataset the proposed methods embedded into various network models exhibit favorable performance improvement in place of the transformer module.

Cite

Text

Kobayashi. "Mean-Shift Feature Transformer." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00578

Markdown

[Kobayashi. "Mean-Shift Feature Transformer." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/kobayashi2024cvpr-meanshift/) doi:10.1109/CVPR52733.2024.00578

BibTeX

@inproceedings{kobayashi2024cvpr-meanshift,
  title     = {{Mean-Shift Feature Transformer}},
  author    = {Kobayashi, Takumi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {6047-6056},
  doi       = {10.1109/CVPR52733.2024.00578},
  url       = {https://mlanthology.org/cvpr/2024/kobayashi2024cvpr-meanshift/}
}