Learning to Optimize with Recurrent Hierarchical Transformers

Abstract

Learning to optimize (L2O) has received a lot of attention recently because of its potential to leverage data to outperform hand-designed optimization algorithms such as Adam. However, they can suffer from high meta-training costs and memory overhead. Recent attempts have been made to reduce the computational costs of these learned optimizers by introducing a hierarchy that enables them to perform most of the heavy computation at the tensor (layer) level rather than the parameter level. This not only leads to sublinear memory cost with respect to number of parameters, but also allows for a higher representation capacity for efficient learned optimization. To this end, we propose an efficient transformer-based learned optimizer which facilitates communication among tensors with self-attention and keeps track of optimization history with recurrence. We show that our optimizer converges faster than strong baselines at a comparable memory overhead, thereby suggesting encouraging scaling trends.

Cite

Text

Moudgil et al. "Learning to Optimize with Recurrent Hierarchical Transformers." ICML 2023 Workshops: Frontiers4LCD, 2023.

Markdown

[Moudgil et al. "Learning to Optimize with Recurrent Hierarchical Transformers." ICML 2023 Workshops: Frontiers4LCD, 2023.](https://mlanthology.org/icmlw/2023/moudgil2023icmlw-learning/)

BibTeX

@inproceedings{moudgil2023icmlw-learning,
  title     = {{Learning to Optimize with Recurrent Hierarchical Transformers}},
  author    = {Moudgil, Abhinav and Knyazev, Boris and Lajoie, Guillaume and Belilovsky, Eugene},
  booktitle = {ICML 2023 Workshops: Frontiers4LCD},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/moudgil2023icmlw-learning/}
}