SMI-TED: A Large-Scale Foundation Model for Materials and Chemistry

Abstract

We present SMI-TED, a large-scale encoder-decoder foundation model for materials and chemistry, trained on 91 million SMILES samples from PubChem using self-supervised learning. Our encoder–decoder architecture supports a wide range of complex tasks, including the prediction of quantum chemical properties and reaction yields. We provide two model variants—289M and $8 \times 289$M parameters—to accommodate different use cases. SMI-TED achieves state-of-the-art performance across multiple benchmark datasets. Latent space analyses reveal signs of compositionality and separability—key properties for higher-level reasoning and few-shot learning. In particular, SMI-TED demonstrates its ability to capture chemically meaningful structure–property relationships without task-specific fine-tuning, as shown by the clustering of nitrogen-containing molecules with high HOMO energies. Compared to an encoder-only baseline, SMI-TED achieves a lower Davies–Bouldin index, highlighting the benefits of its reconstruction-based training objective. To support further research and applications, we publicly release the model weights and source code on HuggingFace and GitHub.

Cite

Text

Brazil et al. "SMI-TED: A Large-Scale Foundation Model for Materials and Chemistry." ICLR 2025 Workshops: AI4MAT, 2025.

Markdown

[Brazil et al. "SMI-TED: A Large-Scale Foundation Model for Materials and Chemistry." ICLR 2025 Workshops: AI4MAT, 2025.](https://mlanthology.org/iclrw/2025/brazil2025iclrw-smited/)

BibTeX

@inproceedings{brazil2025iclrw-smited,
  title     = {{SMI-TED: A Large-Scale Foundation Model for Materials and Chemistry}},
  author    = {Brazil, Emilio Vital and Soares, Eduardo and Shirasuna, Victor Yukio and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
  booktitle = {ICLR 2025 Workshops: AI4MAT},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/brazil2025iclrw-smited/}
}