Efficient Transformer Adaptation with Soft Token Merging
Abstract
We develop an approach to efficiently adapt transformer layers, driven by an objective of optimization stability and broad applicability. Unlike existing methods which adopt either simple heuristics or inefficient discrete optimization methods for token sampling, we craft a lightweight soft token merging system that maintains end-to-end differentiability while maintaining good task performance. To compensate for the potential information loss, we design a novel token inflation module to maximize functionality preservation across different transformer blocks. Experimental results across vision-only, language-only, and vision-language tasks show that our method achieves comparable accuracies while saving considerable computation costs for both training and inference. We demonstrate that these gains translate into real wall-clock speedups.
Cite
Text
Yuan et al. "Efficient Transformer Adaptation with Soft Token Merging." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00369Markdown
[Yuan et al. "Efficient Transformer Adaptation with Soft Token Merging." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/yuan2024cvprw-efficient/) doi:10.1109/CVPRW63382.2024.00369BibTeX
@inproceedings{yuan2024cvprw-efficient,
title = {{Efficient Transformer Adaptation with Soft Token Merging}},
author = {Yuan, Xin and Fei, Hongliang and Baek, Jinoo},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {3658-3668},
doi = {10.1109/CVPRW63382.2024.00369},
url = {https://mlanthology.org/cvprw/2024/yuan2024cvprw-efficient/}
}