Delays in Generalization Match Delayed Changes in Representational Geometry

Abstract

Delayed generalization, also known as ``grokking'', has emerged as a well-replicated phenomenon in overparameterized neural networks. Recent theoretical works associated grokking with the transition from lazy to rich learning regime, measured as the change in the Neural Tangent Kernel (NTK) from its initial state. Here, we present an empirical study on image classification tasks. Surprisingly, we demonstrate that the NTK deviates from its initial state significantly before the onset of grokking, i.e., before test performance increases, suggesting that rich learning does occur before generalization. To explain this difference, we instead look at the representational geometry of the network, and find that grokking coincides in time with a rapid increase in manifold capacity and improved effective geometry metrics. Notably, this sharp transition is absent when generalization is not delayed. Our findings on real data show that lazy and rich training regimes can become decoupled from sudden generalization. In contrast, changes in representational geometry remain tightly linked and may therefore better explain grokking dynamics.

Cite

Text

Zheng et al. "Delays in Generalization Match Delayed Changes in Representational Geometry." NeurIPS 2024 Workshops: UniReps, 2024.

Markdown

[Zheng et al. "Delays in Generalization Match Delayed Changes in Representational Geometry." NeurIPS 2024 Workshops: UniReps, 2024.](https://mlanthology.org/neuripsw/2024/zheng2024neuripsw-delays/)

BibTeX

@inproceedings{zheng2024neuripsw-delays,
  title     = {{Delays in Generalization Match Delayed Changes in Representational Geometry}},
  author    = {Zheng, Xingyu and Daruwalla, Kyle and Benjamin, Ari S and Klindt, David},
  booktitle = {NeurIPS 2024 Workshops: UniReps},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/zheng2024neuripsw-delays/}
}