ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions
Abstract
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction–response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks and ensures response precision through tool planning \& distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the \textbf{high quality} of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the \textbf{dynamic generation of evaluation tasks} that more effectively reveal LLM weaknesses in chemistry; and 3) the significant \textbf{improvement of LLM chemistry capabilities} when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs. The code is available at \url{https://anonymous.4open.science/r/ChemOrch-854A}.
Cite
Text
Huang et al. "ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions." Advances in Neural Information Processing Systems, 2025.Markdown
[Huang et al. "ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/huang2025neurips-chemorch/)BibTeX
@inproceedings{huang2025neurips-chemorch,
title = {{ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions}},
author = {Huang, Yue and Jiang, Zhengzhe and Luo, Xiaonan and Guo, Kehan and Zhuang, Haomin and Zhou, Yujun and Yuan, Zhengqing and Sun, Xiaoqi and Schleinitz, Jules and Wang, Yanbo and Zhang, Shuhao and Surve, Mihir and Chawla, Nitesh V and Wiest, Olaf and Zhang, Xiangliang},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/huang2025neurips-chemorch/}
}