ChemEval: A Multi-Level and Fine-Grained Chemical Capability Evaluation for Large Language Models

Huang, Yuqing; Zhang, Rongyang; He, Xuesong; Zhi, Xuyang; Wang, Hao; Chen, Nuo; Liu, Zongbo; Li, Xin; Xu, Feiyang; Liu, Deguang; Liang, Huadong; YiLi,; Cui, Jian; Xu, Yin; Wang, Shijin; Liu, Qi; Lian, Defu; Liu, Guiquan; Chen, Enhong

ChemEval: A Multi-Level and Fine-Grained Chemical Capability Evaluation for Large Language Models

Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Zongbo Liu, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Qi Liu, Defu Lian, Guiquan Liu, Enhong Chen

ICLR 2026

/iclr/2026/huang2026iclr-chemeval/

Abstract

The emergence of Large Language Models (LLMs) in chemistry marks a significant advancement in applying artificial intelligence to chemical sciences. While these models show promising potential, their effective application in chemistry demands sophisticated evaluation protocols that address the field's inherent complexities. To bridge this critical gap, we introduce ChemEval, an innovative hierarchical assessment framework specifically designed to evaluate LLMs' capabilities across chemical domains. Our methodology incorporates a distinctive four-tier progression system, spanning from basic chemical concepts to advanced theoretical principles. Sixty-two textual and multimodal tasks are designed to enable researchers to conduct fine-grained analysis of model capabilities and achieve precise evaluation via carefully crafted assessment protocols. The framework integrates carefully curated open-source datasets with expert-validated materials, ensuring both practical relevance and scientific rigor. In our experiments, we evaluated the performance of most main-stream LLMs using both zero-shot and few-shot approaches, with carefully designed examples and prompts. Results indicate that general-purpose LLMs, while proficient in understanding chemical literature and following instructions, struggle with tasks requiring deep chemical expertise. In contrast, chemical LLMs perform better in technical tasks but show limitations in general language processing. These findings highlight both the current limitations and future opportunities for LLMs in chemistry. Our research provides a systematic framework for advancing the application of artificial intelligence in chemical research, potentially facilitating new discoveries in the field.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Huang et al. "ChemEval: A Multi-Level and Fine-Grained Chemical Capability Evaluation for Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "ChemEval: A Multi-Level and Fine-Grained Chemical Capability Evaluation for Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-chemeval/)

BibTeX

@inproceedings{huang2026iclr-chemeval,
  title     = {{ChemEval: A Multi-Level and Fine-Grained Chemical Capability Evaluation for Large Language Models}},
  author    = {Huang, Yuqing and Zhang, Rongyang and He, Xuesong and Zhi, Xuyang and Wang, Hao and Chen, Nuo and Liu, Zongbo and Li, Xin and Xu, Feiyang and Liu, Deguang and Liang, Huadong and YiLi,  and Cui, Jian and Xu, Yin and Wang, Shijin and Liu, Qi and Lian, Defu and Liu, Guiquan and Chen, Enhong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-chemeval/}
}