CodeAssistBench (CAB): Dataset & Benchmarking for Multi-Turn Chat-Based Code Assistance

Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, Anoop Deoras

NeurIPS 2025

/neurips/2025/kim2025neurips-codeassistbench/

Abstract

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across 214 repositories, spanning seven languages. Evaluating state-of-the-art models reveals a substantial gap: while models achieve 70–83% accuracy on Stack Overflow–style questions, they solve only 7.22–16.49% of CAB issues from post-training-cutoff repositories. These results highlight a fundamental challenge: current LLMs struggle to provide assistance in realistic, project-specific contexts despite strong performance on traditional Q\&A benchmarks. CAB provides a scalable, reproducible framework for advancing research in multi-turn, codebase-grounded programming agents. The benchmark and pipeline are fully automated and publicly available at https://github.com/amazon-science/CodeAssistBench/.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Kim et al. "CodeAssistBench (CAB): Dataset & Benchmarking for Multi-Turn Chat-Based Code Assistance." Advances in Neural Information Processing Systems, 2025.

Markdown

[Kim et al. "CodeAssistBench (CAB): Dataset & Benchmarking for Multi-Turn Chat-Based Code Assistance." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/kim2025neurips-codeassistbench/)

BibTeX

@inproceedings{kim2025neurips-codeassistbench,
  title     = {{CodeAssistBench (CAB): Dataset & Benchmarking for Multi-Turn Chat-Based Code Assistance}},
  author    = {Kim, Myeongsoo and Garg, Shweta and Ray, Baishakhi and Kumar, Varun and Deoras, Anoop},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/kim2025neurips-codeassistbench/}
}