Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

Abstract

We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite designed for assessing and diagnosing how long-context LMs utilize long contexts in the Lifelong ICL setting. When given a task instruction and test inputs, long-context LMs are expected to leverage the same-task demonstrations in the Lifelong ICL prompt, avoid distraction from other tasks, and achieve a test accuracy no worse than the single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted ``needle-in-a-haystack'' (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the context with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world scenarios faced by long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools to identify model vulnerabilities effectively. We benchmark ten long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open models we evaluate further lack behind by a large margin. Further, we design controlled analysis and find that current long-context models are prone to distractibility and recency bias, as well as other limitations in robustness and instruction understanding.

Cite

Text

Xu et al. "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack." ICML 2024 Workshops: LCFM, 2024.

Markdown

[Xu et al. "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack." ICML 2024 Workshops: LCFM, 2024.](https://mlanthology.org/icmlw/2024/xu2024icmlw-stresstesting/)

BibTeX

@inproceedings{xu2024icmlw-stresstesting,
  title     = {{Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack}},
  author    = {Xu, Xiaoyue and Ye, Qinyuan and Ren, Xiang},
  booktitle = {ICML 2024 Workshops: LCFM},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/xu2024icmlw-stresstesting/}
}