SMI-Editor: Edit-Based SMILES Language Model with Fragment-Level Supervision
Abstract
SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.
Cite
Text
Zheng et al. "SMI-Editor: Edit-Based SMILES Language Model with Fragment-Level Supervision." International Conference on Learning Representations, 2025.Markdown
[Zheng et al. "SMI-Editor: Edit-Based SMILES Language Model with Fragment-Level Supervision." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zheng2025iclr-smieditor/)BibTeX
@inproceedings{zheng2025iclr-smieditor,
title = {{SMI-Editor: Edit-Based SMILES Language Model with Fragment-Level Supervision}},
author = {Zheng, Kangjie and Liang, Siyue and Yang, Junwei and Feng, Bin and Liu, Zequn and Ju, Wei and Xiao, Zhiping and Zhang, Ming},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/zheng2025iclr-smieditor/}
}