NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Zheng, Tianshi; Tam, Kelvin Kiu Wai; Nam, Newt Nguyen Kim Hue; Xu, Baixuan; Wang, Zhaowei; Jiayang, Cheng; Tsang, Hong Ting; Wang, Weiqi; Bai, Jiaxin; Fang, Tianqing; Song, Yangqiu; Wong, Ginny; See, Simon

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Baixuan Xu, Zhaowei Wang, Cheng Jiayang, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Wong, Simon See

ICLR 2026

/iclr/2026/zheng2026iclr-newtonbench/

Abstract

Large language models (LLMs) are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce **NewtonBench**, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive evaluation of 11 state-of-the-art LLMs reveals a clear but fragile capability for discovery in frontier models: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge for the future of automated science. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zheng et al. "NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents." International Conference on Learning Representations, 2026.

Markdown

[Zheng et al. "NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zheng2026iclr-newtonbench/)

BibTeX

@inproceedings{zheng2026iclr-newtonbench,
  title     = {{NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents}},
  author    = {Zheng, Tianshi and Tam, Kelvin Kiu Wai and Nam, Newt Nguyen Kim Hue and Xu, Baixuan and Wang, Zhaowei and Jiayang, Cheng and Tsang, Hong Ting and Wang, Weiqi and Bai, Jiaxin and Fang, Tianqing and Song, Yangqiu and Wong, Ginny and See, Simon},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zheng2026iclr-newtonbench/}
}