Cautious Optimizers: Improving Training with One Line of Code

Abstract

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining and post-training tasks, but also better results in MAE pretraining, with minimum extra tuning on hyperparameters.

Cite

Text

Liang et al. "Cautious Optimizers: Improving Training with One Line of Code." International Conference on Learning Representations, 2026.

Markdown

[Liang et al. "Cautious Optimizers: Improving Training with One Line of Code." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liang2026iclr-cautious/)

BibTeX

@inproceedings{liang2026iclr-cautious,
  title     = {{Cautious Optimizers: Improving Training with One Line of Code}},
  author    = {Liang, Kaizhao and Chen, Lizhang and Liu, Bo and Liu, Qiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liang2026iclr-cautious/}
}