Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
Abstract
Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pre-trained LLMs at scale, through a comprehensive n-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant n-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.
Cite
Text
Antoniades et al. "Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data." ICML 2024 Workshops: FM-Wild, 2024.Markdown
[Antoniades et al. "Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/antoniades2024icmlw-generalization/)BibTeX
@inproceedings{antoniades2024icmlw-generalization,
title = {{Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data}},
author = {Antoniades, Antonis and Wang, Xinyi and Elazar, Yanai and Amayuelas, Alfonso and Albalak, Alon and Zhang, Kexun and Wang, William Yang},
booktitle = {ICML 2024 Workshops: FM-Wild},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/antoniades2024icmlw-generalization/}
}