DeMo: Decoupled Momentum Optimization
Abstract
Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization, a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-$k$ sparsification, and (iii) reuses the momentum buffer for error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M- and 1B-parameter DeMo language models show DeMo transmits up to 85× less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups.
Cite
Text
Peng et al. "DeMo: Decoupled Momentum Optimization." International Conference on Learning Representations, 2026.Markdown
[Peng et al. "DeMo: Decoupled Momentum Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/peng2026iclr-demo/)BibTeX
@inproceedings{peng2026iclr-demo,
title = {{DeMo: Decoupled Momentum Optimization}},
author = {Peng, Bowen and Chen, Lizhang and Su, Baiyu and Quesnelle, Jeffrey and Kingma, Diederik P and Liu, Qiang},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/peng2026iclr-demo/}
}