MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata
Abstract
Despite an ongoing shift in automated chemical literature search methods, many are fairly limited in ability to find very specific relevant information about a drawn molecule and its associated property data. We aim to tackle the challenge of converting drawn molecules to a machine readable representation and co-reference any associated molecule data. MoleculeMiner is a system where a user can feed in their own patent or paper to obtain each drawn molecule along with any specific metadata (chemical name, chemical reactivity, yield, purity etc.) provided anywhere in the PDF in a tabular format, using an interactive user-friendly environment. We also present MolScribeV2, a molecular image parser which improved upon the original MolScribe by introducing pixel-based self attention positional embedding technique. Along with other changes, MolScribeV2 is robust to varied styles of compound drawings commonly found in patents and papers--scanned or born digital. Our extraction and user interactive system can be found at https://github.com/insitro/MoleculeMiner.
Cite
Text
Dey and Stanley. "MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1257Markdown
[Dey and Stanley. "MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/dey2025ijcai-moleculeminer/) doi:10.24963/IJCAI.2025/1257BibTeX
@inproceedings{dey2025ijcai-moleculeminer,
title = {{MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata}},
author = {Dey, Abhisek and Stanley, Nathaniel H.},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {11034-11038},
doi = {10.24963/IJCAI.2025/1257},
url = {https://mlanthology.org/ijcai/2025/dey2025ijcai-moleculeminer/}
}