Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

Abstract

We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks.

Cite

Text

Kim et al. "Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences." ICML 2023 Workshops: SPIGM, 2023.

Markdown

[Kim et al. "Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences." ICML 2023 Workshops: SPIGM, 2023.](https://mlanthology.org/icmlw/2023/kim2023icmlw-bootstrapped/)

BibTeX

@inproceedings{kim2023icmlw-bootstrapped,
  title     = {{Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences}},
  author    = {Kim, Minsu and Berto, Federico and Ahn, Sungsoo and Park, Jinkyoo},
  booktitle = {ICML 2023 Workshops: SPIGM},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/kim2023icmlw-bootstrapped/}
}