Escaping Random Teacher Initialization Enhances Signal Propagation and Representation
Abstract
Recent work shows that by mimicking a random teacher network, student networks learn to produce better feature representations, even if they are initialized at the teacher. In this paper, we characterize how students escape this global optimum and investigate how this process translates into concrete properties of the representations. To that end, we first describe a simplified setup and identify very large step sizes as the main driver of this phenomenon. Then, we investigate key signal propagation and representation separability properties during the escape. Our analysis reveals a two-stage process: the network first undergoes a form of representational collapse, then steers to a parameter region that not only allows for better propagation of input signals but also gives rise to well-conditioned representations. This might relate to the edge of stability and label-independent dynamics.
Cite
Text
Sarnthein et al. "Escaping Random Teacher Initialization Enhances Signal Propagation and Representation." NeurIPS 2023 Workshops: M3L, 2023.Markdown
[Sarnthein et al. "Escaping Random Teacher Initialization Enhances Signal Propagation and Representation." NeurIPS 2023 Workshops: M3L, 2023.](https://mlanthology.org/neuripsw/2023/sarnthein2023neuripsw-escaping/)BibTeX
@inproceedings{sarnthein2023neuripsw-escaping,
title = {{Escaping Random Teacher Initialization Enhances Signal Propagation and Representation}},
author = {Sarnthein, Felix and Singh, Sidak Pal and Orvieto, Antonio and Hofmann, Thomas},
booktitle = {NeurIPS 2023 Workshops: M3L},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/sarnthein2023neuripsw-escaping/}
}