Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences
Abstract
Engineering regulatory DNA sequences with precise activity levels remains a major challenge in medicine and biotechnology due to the vast combinatorial space of possible sequences and the complex regulatory grammars governing gene expression. DNA discrete diffusion (D3) has emerged as a promising approach for learning these distributions and generating biologically relevant sequences, yet several key aspects of its capabilities remain unexplored. Here we systematically investigate D3’s performance in biologically relevant, understudied scenarios. First, we demonstrate that D3 maintains robust performance even with limited training data, highlighting its practical utility in real-world applications where data is scarce. Second, we extend D3’s conditional generation capabilities for categorical data, employing classifier-free guidance to improve the quality and specificity of generated sequences. Third, we analyze sequence trajectories during the diffusion process, providing insights into how discrete diffusion navigates the sequence-function landscape. Together, these findings expand our understanding of D3’s strengths and limitations, while introducing new methodological advances for engineering functional regulatory DNA sequences.
Cite
Text
Sarkar et al. "Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences." ICLR 2025 Workshops: AI4NA, 2025.Markdown
[Sarkar et al. "Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences." ICLR 2025 Workshops: AI4NA, 2025.](https://mlanthology.org/iclrw/2025/sarkar2025iclrw-understanding/)BibTeX
@inproceedings{sarkar2025iclrw-understanding,
title = {{Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences}},
author = {Sarkar, Anirban and Kang, Yijie and Somia, Nirali and Koo, Peter K},
booktitle = {ICLR 2025 Workshops: AI4NA},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/sarkar2025iclrw-understanding/}
}