Learning to Grok: Emergence of In-Context Learning and Skill Composition in Modular Arithmetic Tasks
Abstract
Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a x + b y \text{ mod } p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is transient, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.
Cite
Text
He et al. "Learning to Grok: Emergence of In-Context Learning and Skill Composition in Modular Arithmetic Tasks." ICML 2024 Workshops: MI, 2024.Markdown
[He et al. "Learning to Grok: Emergence of In-Context Learning and Skill Composition in Modular Arithmetic Tasks." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/he2024icmlw-learning/)BibTeX
@inproceedings{he2024icmlw-learning,
title = {{Learning to Grok: Emergence of In-Context Learning and Skill Composition in Modular Arithmetic Tasks}},
author = {He, Tianyu and Doshi, Darshil and Das, Aritra and Gromov, Andrey},
booktitle = {ICML 2024 Workshops: MI},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/he2024icmlw-learning/}
}