SALSA: Semantically-Aware Latent Space Autoencoder

Abstract

In learning molecular representations, SMILES strings enable the use of powerful NLP methodologies, such as sequence autoencoders. However, an autoencoder trained solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, which capture structural similarities between molecules. We demonstrate by example that a standard SMILES autoencoder may map structurally similar molecules to distant latent vectors, resulting in an incoherent latent space. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive objective of mapping structurally similar molecules to nearby vectors in the latent space. We evaluate semantic awareness of SALSA representations by comparing to a naive autoencoder as well as the standard ECFP4. We show empirically that SALSA learns a representation that maintains 1) structural awareness, 2) physicochemical property awareness, 3) biological property awareness, and 4) semantic continuity.

Cite

Text

Kirchoff et al. "SALSA: Semantically-Aware Latent Space Autoencoder." NeurIPS 2023 Workshops: AI4D3, 2023.

Markdown

[Kirchoff et al. "SALSA: Semantically-Aware Latent Space Autoencoder." NeurIPS 2023 Workshops: AI4D3, 2023.](https://mlanthology.org/neuripsw/2023/kirchoff2023neuripsw-salsa/)

BibTeX

@inproceedings{kirchoff2023neuripsw-salsa,
  title     = {{SALSA: Semantically-Aware Latent Space Autoencoder}},
  author    = {Kirchoff, Kathryn E and Maxfield, Travis and Tropsha, Alexander and Gomez, Shawn M},
  booktitle = {NeurIPS 2023 Workshops: AI4D3},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/kirchoff2023neuripsw-salsa/}
}