Cautious Weight Decay

Abstract

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Cite

Text

Chen et al. "Cautious Weight Decay." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "Cautious Weight Decay." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-cautious/)

BibTeX

@inproceedings{chen2026iclr-cautious,
  title     = {{Cautious Weight Decay}},
  author    = {Chen, Lizhang and Li, Jonathan and Liang, Kaizhao and Su, Baiyu and Xie, Cong and Liang, Chen and Lao, Ni and Liu, Qiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-cautious/}
}