An Efficient Tokenization for Molecular Language Models

Abstract

Recently, molecular language models have shown great potential in various chemical applications, e.g., drug-discovery. These models adapt auto-regressive language models to molecular data by considering molecules as sequences of atoms, where each atom is mapped to individual tokens of the language models. However, such atom-level tokenizations limit the models' ability to capture the global structural context of molecules. To tackle this issue, we propose a novel molecular language model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the importance of the substructure-level contexts, e.g., ring systems, in understanding molecules, we introduce substructure-level tokenization for molecular language models. Specifically, we construct a tree structure for each molecule whose nodes correspond to important substructures, i.e., motifs. Then, we train our CAMT5 by considering a molecule as a sequence of motif tokens, whose order is determined by a tree-search algorithm. Under the proposed motif token space, one can incorporate chemical context with a significantly shorter token length (than atom-level tokenizations), which is useful for mitigating the issues during the auto-regressive molecular generation, e.g., error propagation. In addition, CAMT5 guarantees to generate a valid molecule with non-degeneracy, i.e., no ambiguity in the meaning of each token, which is also overlooked in previous models. Extensive experiments demonstrate the effectiveness of CAMT5 in the text-to-molecule generation task. Finally, we also propose a simple strategy of ensemble that can aggregate the outputs of molecular language models of different tokenizations, e.g., SMILES, SELFIES and ours, further boosting the quality of the generated molecules.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Kim et al. "An Efficient Tokenization for Molecular Language Models." NeurIPS 2024 Workshops: AIDrugX, 2024.

Markdown

[Kim et al. "An Efficient Tokenization for Molecular Language Models." NeurIPS 2024 Workshops: AIDrugX, 2024.](https://mlanthology.org/neuripsw/2024/kim2024neuripsw-efficient-a/)

BibTeX

@inproceedings{kim2024neuripsw-efficient-a,
  title     = {{An Efficient Tokenization for Molecular Language Models}},
  author    = {Kim, Seojin and Nam, Jaehyun and Shin, Jinwoo},
  booktitle = {NeurIPS 2024 Workshops: AIDrugX},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/kim2024neuripsw-efficient-a/}
}