Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale
Abstract
Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of protein language models, genome language models remain nascent. Recent studies suggest the bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be the case that even short DNA sequences are modeled poorly by existing approaches, and current models are unable to represent the wide array of functions encoded by DNA. To study this, we develop AIDO.DNA, a pretrained module for DNA representation in an AI-driven Digital Organism [1]. AIDO.DNA is a seven billion parameter encoder-only transformer trained on 10.6 billion nucleotides from a dataset of 796 species. By scaling model size while maintaining a short context length of 4k nucleotides, AIDO.DNA shows substantial improvements across a breadth of supervised, generative, and zero-shot tasks relevant to functional genomics, synthetic biology, and drug development. Notably, AIDO.DNA outperforms prior encoder-only architectures _without_ new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.
Cite
Text
Ellington et al. "Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale." NeurIPS 2024 Workshops: AIDrugX, 2024.Markdown
[Ellington et al. "Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale." NeurIPS 2024 Workshops: AIDrugX, 2024.](https://mlanthology.org/neuripsw/2024/ellington2024neuripsw-accurate/)BibTeX
@inproceedings{ellington2024neuripsw-accurate,
title = {{Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale}},
author = {Ellington, Caleb and Sun, Ning and Ho, Nicholas and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Xing, Eric P. and Song, Le},
booktitle = {NeurIPS 2024 Workshops: AIDrugX},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/ellington2024neuripsw-accurate/}
}