Loss Landscape Geometry Reveals Stagewise Development of Transformers

Abstract

The development of the internal structure of neural networks throughout training occurs in tandem with changes in the local geometry of the population loss. By quantifying the degeneracy of this geometry using the recently proposed Local Learning Coefficient, we show that the training process for a transformer language model can be decomposed into discrete developmental stages. We connect these stages to interpretable shifts in input–output behavior and developments in internal structure. These findings offer new insights into transformer development and underscore the crucial role of loss landscape geometry in understanding the dynamics of deep learning.

Cite

Text

Wang et al. "Loss Landscape Geometry Reveals Stagewise Development of Transformers." ICML 2024 Workshops: HiLD, 2024.

Markdown

[Wang et al. "Loss Landscape Geometry Reveals Stagewise Development of Transformers." ICML 2024 Workshops: HiLD, 2024.](https://mlanthology.org/icmlw/2024/wang2024icmlw-loss/)

BibTeX

@inproceedings{wang2024icmlw-loss,
  title     = {{Loss Landscape Geometry Reveals Stagewise Development of Transformers}},
  author    = {Wang, George and Farrugia-Roberts, Matthew and Hoogland, Jesse and Carroll, Liam and Wei, Susan and Murfet, Daniel},
  booktitle = {ICML 2024 Workshops: HiLD},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/wang2024icmlw-loss/}
}