Differentiable Attention Sparsity via Structured $d$-Gating
Abstract
A core component of modern large language models is the attention mechanism, but its immense parameter count necessitates structured sparsity for resource-efficient optimization and inference. Traditional sparsity penalties, such as the group lasso, are non-smooth and thus incompatible with standard stochastic gradient descent methods. To address this, we propose a deep gating mechanism that reformulates the structured sparsity penalty into a fully differentiable optimization problem, allowing effective and principled norm-based group sparsification without requiring specialized non-smooth optimizers. Our theoretical analysis and empirical results demonstrate that this approach enables structured sparsity with simple stochastic gradient descent or variants while maintaining predictive performance.
Cite
Text
Kolb et al. "Differentiable Attention Sparsity via Structured $d$-Gating." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Kolb et al. "Differentiable Attention Sparsity via Structured $d$-Gating." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/kolb2025iclrw-differentiable/)BibTeX
@inproceedings{kolb2025iclrw-differentiable,
title = {{Differentiable Attention Sparsity via Structured $d$-Gating}},
author = {Kolb, Chris and Bischl, Bernd and Rügamer, David},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/kolb2025iclrw-differentiable/}
}