DeMo: Decoupled Momentum Optimization

Abstract

Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization, a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-$k$ sparsification, and (iii) reuses the momentum buffer for error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M- and 1B-parameter DeMo language models show DeMo transmits up to 85× less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups.

Cite

Text

Peng et al. "DeMo: Decoupled Momentum Optimization." International Conference on Learning Representations, 2026.

Markdown

[Peng et al. "DeMo: Decoupled Momentum Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/peng2026iclr-demo/)

BibTeX

@inproceedings{peng2026iclr-demo,
  title     = {{DeMo: Decoupled Momentum Optimization}},
  author    = {Peng, Bowen and Chen, Lizhang and Su, Baiyu and Quesnelle, Jeffrey and Kingma, Diederik P and Liu, Qiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/peng2026iclr-demo/}
}