MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Abstract

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability is important to measure because its emergence could enable self-improvement and adaptation, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a range of different scenarios. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. Current models are far from passing the most difficult versions of the proposed tasks, and our tasks can be used to measure instrumental self-reasoning ability in future models.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Fronsdal and Lindner. "MISR: Measuring Instrumental Self-Reasoning in Frontier Models." NeurIPS 2024 Workshops: SoLaR, 2024.

Markdown

[Fronsdal and Lindner. "MISR: Measuring Instrumental Self-Reasoning in Frontier Models." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/fronsdal2024neuripsw-misr/)

BibTeX

@inproceedings{fronsdal2024neuripsw-misr,
  title     = {{MISR: Measuring Instrumental Self-Reasoning in Frontier Models}},
  author    = {Fronsdal, Kai and Lindner, David},
  booktitle = {NeurIPS 2024 Workshops: SoLaR},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/fronsdal2024neuripsw-misr/}
}