What Algorithms Can Transformers Learn? a Study in Length Generalization

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Joshua M. Susskind, Samy Bengio, Preetum Nakkiran

ICLR 2024

/iclr/2024/zhou2024iclr-algorithms/

Abstract

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. In this work, we focus on length generalization, and we propose a unifying framework to understand when and how Transformers can be expected to length generalize on a given task. First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization. For these tasks, we leverage the RASP programming language (Weiss et al., 2021) to show that the correct algorithmic solution which solves the task can be represented by a simple Transformer. We thus propose the RASP-Generalization Conjecture: Transformers tend to learn a length-generalizing solution if there exists a short RASP-L program that works for all input lengths. We present empirical evidence to support the correlation between RASP-simplicity and generalization. We leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition), and we illustrate how scratchpad can hinder generalization when it increases the complexity of the corresponding RASP-L program. Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers.

PDF ICLR Semantic Scholar

Cite

Text

Zhou et al. "What Algorithms Can Transformers Learn? a Study in Length Generalization." International Conference on Learning Representations, 2024.

Markdown

[Zhou et al. "What Algorithms Can Transformers Learn? a Study in Length Generalization." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/zhou2024iclr-algorithms/)

BibTeX

@inproceedings{zhou2024iclr-algorithms,
  title     = {{What Algorithms Can Transformers Learn? a Study in Length Generalization}},
  author    = {Zhou, Hattie and Bradley, Arwen and Littwin, Etai and Razin, Noam and Saremi, Omid and Susskind, Joshua M. and Bengio, Samy and Nakkiran, Preetum},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/zhou2024iclr-algorithms/}
}