Cautious Weight Decay
Abstract
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Cite
Text
Chen et al. "Cautious Weight Decay." International Conference on Learning Representations, 2026.Markdown
[Chen et al. "Cautious Weight Decay." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-cautious/)BibTeX
@inproceedings{chen2026iclr-cautious,
title = {{Cautious Weight Decay}},
author = {Chen, Lizhang and Li, Jonathan and Liang, Kaizhao and Su, Baiyu and Xie, Cong and Liang, Chen and Lao, Ni and Liu, Qiang},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/chen2026iclr-cautious/}
}