MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Abstract
We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability is important to measure because its emergence could enable self-improvement and adaptation, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a range of different scenarios. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. Current models are far from passing the most difficult versions of the proposed tasks, and our tasks can be used to measure instrumental self-reasoning ability in future models.
Cite
Text
Fronsdal and Lindner. "MISR: Measuring Instrumental Self-Reasoning in Frontier Models." NeurIPS 2024 Workshops: SoLaR, 2024.Markdown
[Fronsdal and Lindner. "MISR: Measuring Instrumental Self-Reasoning in Frontier Models." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/fronsdal2024neuripsw-misr/)BibTeX
@inproceedings{fronsdal2024neuripsw-misr,
title = {{MISR: Measuring Instrumental Self-Reasoning in Frontier Models}},
author = {Fronsdal, Kai and Lindner, David},
booktitle = {NeurIPS 2024 Workshops: SoLaR},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/fronsdal2024neuripsw-misr/}
}