Attention Is All You Need for Mixture-of-Depths Routing
Abstract
Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically focus computations on the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity to the model. In this paper, we introduce a novel attention-based routing mechanism *A-MoD* that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, *A-MoD* allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pre-trained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to $2$\% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines.
Cite
Text
Gadhikar et al. "Attention Is All You Need for Mixture-of-Depths Routing." ICLR 2025 Workshops: SCOPE, 2025.Markdown
[Gadhikar et al. "Attention Is All You Need for Mixture-of-Depths Routing." ICLR 2025 Workshops: SCOPE, 2025.](https://mlanthology.org/iclrw/2025/gadhikar2025iclrw-attention/)BibTeX
@inproceedings{gadhikar2025iclrw-attention,
title = {{Attention Is All You Need for Mixture-of-Depths Routing}},
author = {Gadhikar, Advait and Majumdar, Souptik Kumar and Popp, Niclas and Saranrittichai, Piyapat and Rapp, Martin and Schott, Lukas},
booktitle = {ICLR 2025 Workshops: SCOPE},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/gadhikar2025iclrw-attention/}
}