SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines
Abstract
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model Gemini-2.5-Pro achieved the highest accuracy of 63.56% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
Cite
Text
Du et al. "SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines." Advances in Neural Information Processing Systems, 2025.Markdown
[Du et al. "SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/du2025neurips-supergpqa/)BibTeX
@inproceedings{du2025neurips-supergpqa,
title = {{SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines}},
author = {Du, Xeron and Yao, Yifan and Ma, Kaijing and Wang, Bingli and Zheng, Tianyu and Zhu, King and Liu, Minghao and Liang, Yiming and Jin, Xiaolong and Wei, Zhenlin and Zheng, Chujie and Deng, Kaixin and Guo, Shuyue and Jia, Shian and Jiang, Sichao and Liao, Yiyan and Li, Rui and Li, Qinrui and Li, Sirun and Li, Yizhi and Li, Yunwen and Ma, Dehua and Ni, Yuansheng and Que, Haoran and Wang, Qiyao and Wen, Zhoufutu and Wu, Siwei and Xing, Tianshun and 许明, and Yang, Zhenzhu and Wang, Zekun Moore and Zhou, Junting and Bai, Yuelin and Bu, Xingyuan and Cai, Chenglin and Chen, Liang and Chen, Yifan and Chengtuo, Cheng and Cheng, Tianhao and Ding, Keyi and Huang, Siming and Yun, Huang and Li, Yaoru and Li, Yizhe and Li, Zhaoqun and Liang, Tianhao and Lin, Chengdong and Lin, Hongquan and Ma, Yinghao and Peng, Z.Y. and Peng, Zifan and Qi, Qige and Qiu, Shi and Qu, Xingwei and Quan, Shanghaoran and Tan, Yizhou and Wang, Zili and 王晨清, and Wang, Hao and Wang, Yiya and Wang, Yubo and Xu, Jiajun and Yang, Kexin and Yuan, Ruibin and Yue, Yuanhao and Zhan, Tianyang and Zhang, Chun and Zhang, Jinyang and Zhang, Xiyue and Zhang, Owen Xingjian and Zhang, Yue and Zhao, Yongchi and Zheng, Xiangyu and ChenghuaZhong, and Gao, Yang and Li, Zhoujun and Liu, Dayiheng and Liu, Qian and Liu, Tianyu and Ni, Shiwen and Peng, Junran and Qin, Yujia and Su, Wenbo and Wang, Guoyin and Wang, Shi and Yang, Jian and Yang, Min and Cao, Meng and Yue, Xiang and Zhang, Zhaoxiang and Zhou, Wangchunshu and Liu, Jiaheng and Lin, Qunshu and Huang, Wenhao and Zhang, Ge},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/du2025neurips-supergpqa/}
}