Knowledge Graph Extraction from Total Synthesis Documents

Abstract

Knowledge graphs (KGs) have emerged as a powerful tool for organizing and integrating complex information, making it a suitable format for scientific knowledge. However, translating scientific knowledge into KGs is challenging as a wide variety of styles and elements to present data and ideas is used. Although efforts for KG extraction (KGE) from scientific documents exist, evaluation remains challenging and field-dependent; and existing benchmarks do not focuse on scientific information. Furthermore, establishing a general benchmark for this task is challenging as not all scientific knowledge has a ground-truth KG representation, making any benchmark prone to ambiguity. Here we propose Graph of Organic Synthesis Benchmark (GOSyBench), a benchmark for KG extraction from scientific documents in chemistry, that leverages the native KG-like structure of synthetic routes in organic chemistry. We develop KG-extraction algorithms based on LLMs (GPT-4, Claude, Mistral) and VLMs (GPT-4o), the best of which reaches 73% recovery accuracy and 59% precision, leaving a lot of room for improvement. We expect GOSyBench can serve as a valuable resource for evaluating and advancing KGE methods in the scientific domain, ultimately facilitating better organization, integration, and discovery of scientific knowledge.

Cite

Text

Bran et al. "Knowledge Graph Extraction from Total Synthesis Documents." ICML 2024 Workshops: AI4Science, 2024.

Markdown

[Bran et al. "Knowledge Graph Extraction from Total Synthesis Documents." ICML 2024 Workshops: AI4Science, 2024.](https://mlanthology.org/icmlw/2024/bran2024icmlw-knowledge/)

BibTeX

@inproceedings{bran2024icmlw-knowledge,
  title     = {{Knowledge Graph Extraction from Total Synthesis Documents}},
  author    = {Bran, Andres M and Jončev, Zlatko and Schwaller, Philippe},
  booktitle = {ICML 2024 Workshops: AI4Science},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/bran2024icmlw-knowledge/}
}