SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset
Abstract
Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Structurally Augmented IC50 Repository (SAIR), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset comprises $5,244,285$ structures across $1,048,857$ unique protein-ligand systems, curated from the ChEMBL and BindingDB databases, which were then computationally folded using the Boltz-1x model. We provide a comprehensive characterization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately $3 \%$ of structures exhibit physical anomalies, predominantly related to internal energy violations. As an initial demonstration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, neither exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and physical underpinnings of protein-ligand interactions. The link to the data will be added upon publication, to preserve anonymity of the submission.
Cite
Text
Lemos et al. "SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset." International Conference on Learning Representations, 2026.Markdown
[Lemos et al. "SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lemos2026iclr-sair/)BibTeX
@inproceedings{lemos2026iclr-sair,
title = {{SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset}},
author = {Lemos, Pablo and Beckwith, Zane and Bandi, Sasaank and Van Damme, Maarten and Crivelli-Decker, Jordan and Shields, Benjamin J. and Merth, Thomas and Jha, Punit K and De Mitri, Nicola and Callahan, Tiffany and Nish, Aj and Abruzzo, Paul and Salomon-Ferrer, Romelia and Ganahl, Martin},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/lemos2026iclr-sair/}
}