In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure

Abstract

Large language models (LLMs) like transformers have impressive in-context learning (ICL) capabilities; they can generate predictions for new queries based on input-output sequences in prompts without parameter updates. While many theories have attempted to explain ICL, they often focus on structured training data similar to ICL tasks, such as regression. In practice, however, these models are trained in an unsupervised manner on unstructured text data, which bears little resemblance to ICL tasks. To this end, we investigate how ICL occurs from unsupervised training on unstructured data. The key observation is that ICL can arise simply by modeling co-occurrence information using classical language models like continuous bag of words (CBOW), which we prove and empirically validate. Furthermore, we establish the necessity of positional information and nuisance token structure to generalize ICL to unseen data. Lastly, we present cases where ICL fails and offer theoretical explanations, indicating that the ICL ability of LLMs can be sensitive to the structure of the training data.

Cite

Text

Wibisono and Wang. "In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Wibisono and Wang. "In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/wibisono2024icmlw-incontext-a/)

BibTeX

@inproceedings{wibisono2024icmlw-incontext-a,
  title     = {{In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure}},
  author    = {Wibisono, Kevin Christian and Wang, Yixin},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/wibisono2024icmlw-incontext-a/}
}