Pay Attention to Small Weights
Abstract
Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in fine-tuning settings than in training from scratch. Motivated by this observation, we propose \textsc{NanoAdam}, which dynamically updates only the small-magnitude weights during fine-tuning and offers several practical advantages: first, the criterion is \emph{gradient-free}—the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pre-training, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
Cite
Text
Zhou et al. "Pay Attention to Small Weights." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhou et al. "Pay Attention to Small Weights." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhou2025neurips-pay/)BibTeX
@inproceedings{zhou2025neurips-pay,
title = {{Pay Attention to Small Weights}},
author = {Zhou, Chao and Jacobs, Tom and Gadhikar, Advait and Burkholz, Rebekka},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhou2025neurips-pay/}
}