RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Abstract

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the chaining of longer pseudo-tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22\% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87\%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9\% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Gautam et al. "RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code." NeurIPS 2024 Workshops: OWA, 2024.

Markdown

[Gautam et al. "RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code." NeurIPS 2024 Workshops: OWA, 2024.](https://mlanthology.org/neuripsw/2024/gautam2024neuripsw-refactorbench/)

BibTeX

@inproceedings{gautam2024neuripsw-refactorbench,
  title     = {{RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code}},
  author    = {Gautam, Dhruv and Garg, Spandan and Jang, Jinu and Sundaresan, Neel and Moghaddam, Roshanak Zilouchian},
  booktitle = {NeurIPS 2024 Workshops: OWA},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/gautam2024neuripsw-refactorbench/}
}