StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

Abstract

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.

Cite

Text

Wu et al. "StreamBench: Towards Benchmarking Continuous Improvement of Language Agents." Neural Information Processing Systems, 2024. doi:10.52202/079017-3398

Markdown

[Wu et al. "StreamBench: Towards Benchmarking Continuous Improvement of Language Agents." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wu2024neurips-streambench/) doi:10.52202/079017-3398

BibTeX

@inproceedings{wu2024neurips-streambench,
  title     = {{StreamBench: Towards Benchmarking Continuous Improvement of Language Agents}},
  author    = {Wu, Cheng-Kuang and Tam, Zhi Rui and Lin, Chieh-Yen and Chen, Yun-Nung and Lee, Hung-yi},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3398},
  url       = {https://mlanthology.org/neurips/2024/wu2024neurips-streambench/}
}