PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Abstract
Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9\% accuracy compared to human experts' 61.9\%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204\% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.
Cite
Text
Qiu et al. "PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models." Advances in Neural Information Processing Systems, 2025.Markdown
[Qiu et al. "PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/qiu2025neurips-phybench/)BibTeX
@inproceedings{qiu2025neurips-phybench,
title = {{PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models}},
author = {Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Haoxu, Zhang and Hu, Yi and Wang, Chenyang and Tang, Chencheng and Chang, Haoling and Liu, Qi and Zhou, Ziheng and Zhang, Tianyu and Zhang, Jingtian and Liu, Zhangyi and Li, Minghao and Zhang, Yuku and Jing, Boxuan and Yin, Xianqi and Ren, Yutong and Fu, Zizhuo and Ji, Jiaming and Wang, Weike and Tian, Xudong and Lv, Anqi and Man, Laifu and Li, Jianxiang and Tao, Feiyu and Sun, Qihua and Liang, Zhou and Mu, Yushu and Li, Zhongxuan and Zhang, Jing-Jun and Zhang, Shutao and Li, Xiaotian and Xia, Xingqi and Lin, Jiawei and Shen, Zheyu and Chen, Jiahang and Xiong, Qiuhao and Wang, Binran and Wang, Fengyuan and Niziyang, and Zhang, Bohan and Cui, Fan and Shaochangkun, and Cao, Qing-Hong and Luo, Ming-xing and Zhang, Muhan and Zhu, Hua Xing},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/qiu2025neurips-phybench/}
}