Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task

Abstract

Many benchmarks focus on evaluating Large Language Models (LLMs) on facts and semantic relations, primarily assessing their semantic memory. However, some memories in language are linked to their contexts, like time and place, following Human episodic memory. To address the gap in evaluating memory in LLMs, we introduce the Sequence Order Recall Task (SORT). SORT requires LLMs to recall the correct order of text segments from a text excerpt. We present an initial evaluation dataset, Book-SORT, comprising 36000 samples extracted from 9 books recently added to the public domain. When the text is given to models in-context, we find that instruction-tuned LLMs can perform this task. However, when models need to rely memory stored in their weights or not presented with the text excerpts, their accuracies drop below 60%, near or at chance levels. We hope that SORT will drive the development of memory-augmented LLMs.

Cite

Text

Pink et al. "Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task." NeurIPS 2024 Workshops: LXAI, 2024.

Markdown

[Pink et al. "Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task." NeurIPS 2024 Workshops: LXAI, 2024.](https://mlanthology.org/neuripsw/2024/pink2024neuripsw-testing/)

BibTeX

@inproceedings{pink2024neuripsw-testing,
  title     = {{Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task}},
  author    = {Pink, Mathis and Vo, Vy A. and Wu, Qinyuan and Mu, Jianing and Turek, Javier S. and Hasson, Uri and Norman, Kenneth A. and Michelmann, Sebastian and Huth, Alexander and Toneva, Mariya},
  booktitle = {NeurIPS 2024 Workshops: LXAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/pink2024neuripsw-testing/}
}