Evolution of Concepts in Language Model Pre-Training

Abstract

Language models obtain extensive capabilities through pre-training. However, the pre-training dynamics remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics. Our code is available at https://github.com/OpenMOSS/Language-Model-SAEs.

Cite

Text

Ge et al. "Evolution of Concepts in Language Model Pre-Training." International Conference on Learning Representations, 2026.

Markdown

[Ge et al. "Evolution of Concepts in Language Model Pre-Training." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ge2026iclr-evolution/)

BibTeX

@inproceedings{ge2026iclr-evolution,
  title     = {{Evolution of Concepts in Language Model Pre-Training}},
  author    = {Ge, Xuyang and Shu, Wentao and Wu, Jiaxing and Zhou, Yunhua and He, Zhengfu and Qiu, Xipeng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ge2026iclr-evolution/}
}