Multi-SWE-Bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Aoyan Li, Lu Chen, Xiaojian Zhong, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Ming Ding, Liang Xiang

NeurIPS 2025

/neurips/2025/zan2025neurips-multiswebench/

Abstract

The task of issue resolving aims to modify a codebase to generate a patch that addresses a given issue. However, most existing benchmarks focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across different programming languages. To bridge this gap, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering 8 languages of Python, Java, TypeScript, JavaScript, Go, Rust, C, and C++. In particular, this benchmark includes a total of 2,132 high-quality instances, carefully curated by 68 expert annotators, ensuring a reliable and accurate evaluation of LLMs on the issue-resolving task. Based on human-annotated results, the issues are further classified into three difficulty levels. We evaluate a series of state-of-the-art models on Multi-SWE-bench, utilizing both procedural and agent-based frameworks for issue resolving. Our experiments reveal three key findings: (1) Limited generalization across languages: While existing LLMs perform well on Python issues, their ability to generalize across other languages remains limited; (2) Performance aligned with human-annotated difficulty: LLM-based agents' performance closely aligns with human-assigned difficulty, with resolution rates decreasing as issue complexity rises; and (3) Performance drop on cross-file issues: The performance of current methods significantly deteriorates when handling cross-file issues. These findings highlight the limitations of current LLMs and underscore the need for more robust models capable of handling a broader range of programming languages and complex issue scenarios.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zan et al. "Multi-SWE-Bench: A Multilingual Benchmark for Issue Resolving." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zan et al. "Multi-SWE-Bench: A Multilingual Benchmark for Issue Resolving." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zan2025neurips-multiswebench/)

BibTeX

@inproceedings{zan2025neurips-multiswebench,
  title     = {{Multi-SWE-Bench: A Multilingual Benchmark for Issue Resolving}},
  author    = {Zan, Daoguang and Huang, Zhirong and Liu, Wei and Chen, Hanwu and Xin, Shulin and Zhang, Linhao and Liu, Qi and Li, Aoyan and Chen, Lu and Zhong, Xiaojian and Liu, Siyao and Xiao, Yongsheng and Chen, Liangqiang and Zhang, Yuyu and Su, Jing and Liu, Tianyu and Long, Rui and Ding, Ming and Xiang, Liang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zan2025neurips-multiswebench/}
}