Do Chemical Language Models Provide a Better Compound Representation?

Abstract

In recent years, several chemical language models have been developed, inspired by the success of protein language models and advancements in natural language processing. In this study, we explore whether pre-training a chemical language model on billion-scale compound datasets, such as Enamine and ZINC20, can lead to improved compound representation in the drug space. We compare the learned representations of these models with de the facto standard compound representation, and evaluate their potential application in drug discovery and development by benchmarking them on biophysics, physiology, and physical chemistry datasets. Our findings suggest that the conventional masked language modeling approach on these extensive pre-training datasets is insufficient in enhancing compound representations. This highlights the need for additional physicochemical inductive bias in the modeling beyond scaling the dataset size.

Cite

Text

Torrisi et al. "Do Chemical Language Models Provide a Better Compound Representation?." NeurIPS 2023 Workshops: AI4D3, 2023.

Markdown

[Torrisi et al. "Do Chemical Language Models Provide a Better Compound Representation?." NeurIPS 2023 Workshops: AI4D3, 2023.](https://mlanthology.org/neuripsw/2023/torrisi2023neuripsw-chemical/)

BibTeX

@inproceedings{torrisi2023neuripsw-chemical,
  title     = {{Do Chemical Language Models Provide a Better Compound Representation?}},
  author    = {Torrisi, Mirko and Asadollahi, Saeid and De la Vega de Leon, Antonio and Wang, Kai and Copeland, Wilbert},
  booktitle = {NeurIPS 2023 Workshops: AI4D3},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/torrisi2023neuripsw-chemical/}
}