MultiMax: Sparse and Multi-Modal Attention Learning
Abstract
SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multimodality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive analysis and evaluation, we show that MultiMax successfully produces a distribution that supresses irrelevant entries while preserving multi-modality, with benefits in image classification, language modeling and machine translation.
Cite
Text
Zhou et al. "MultiMax: Sparse and Multi-Modal Attention Learning." International Conference on Machine Learning, 2024.Markdown
[Zhou et al. "MultiMax: Sparse and Multi-Modal Attention Learning." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/zhou2024icml-multimax/)BibTeX
@inproceedings{zhou2024icml-multimax,
title = {{MultiMax: Sparse and Multi-Modal Attention Learning}},
author = {Zhou, Yuxuan and Fritz, Mario and Keuper, Margret},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {61897-61912},
volume = {235},
url = {https://mlanthology.org/icml/2024/zhou2024icml-multimax/}
}