Mean-Shift Feature Transformer
Abstract
Transformer models developed in NLP make a great impact on computer vision fields producing promising performance on various tasks. While multi-head attention a characteristic mechanism of the transformer attracts keen research interest such as for reducing computation cost we analyze the transformer model from a viewpoint of feature transformation based on a distribution of input feature tokens. The analysis inspires us to derive a novel transformation method from mean-shift update which is an effective gradient ascent to seek a local mode of distinctive representation on the token distribution. We also present an efficient projection approach to reduce parameter size of linear projections constituting the proposed multi-head feature transformation. In the experiments on ImageNet-1K dataset the proposed methods embedded into various network models exhibit favorable performance improvement in place of the transformer module.
Cite
Text
Kobayashi. "Mean-Shift Feature Transformer." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00578Markdown
[Kobayashi. "Mean-Shift Feature Transformer." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/kobayashi2024cvpr-meanshift/) doi:10.1109/CVPR52733.2024.00578BibTeX
@inproceedings{kobayashi2024cvpr-meanshift,
title = {{Mean-Shift Feature Transformer}},
author = {Kobayashi, Takumi},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {6047-6056},
doi = {10.1109/CVPR52733.2024.00578},
url = {https://mlanthology.org/cvpr/2024/kobayashi2024cvpr-meanshift/}
}