Genomics Data Lossless Compression with (s, K)-Mer Encoding and Deep Neural Networks

Abstract

Learning-based compression shows competitive compression ratios for genomics data. It often includes three types of compressors: static, adaptive and semi-adaptive. However, these existing compressors suffer from inferior compression ratios or throughput, and adaptive compressors also faces model cold-start problems. To address these issues, we propose DeepGeCo, a novel genomics data lossless adaptive compression framework with (s,k)-mer encoding and deep neural networks, involving three compression modes (MINI for static, PLUS for adaptive, ULTRA for semi-adaptive) for flexible requirements of compression ratios or throughput. In DeepGeCo, (1) we develop BiGRU and Transformer as the backbone to build Warm-Start and Supporter models in terms of cold-start problems. (2) We introduce (s,k)-mer encoding to pre-process genomics data before feeding it into the DNN model for improve model throughput, and we propose a new metric - Ranking of Throughput and Compression Ratio (RTCR) for effective encoding parameters selection. (3) We design a threshold controller and a probabilistic mixer within the backbone to balance compression ratios and model throughput. Experiments on 10 real-world datasets show that DeepGeCo's three compression modes improve up to a 22.949X average throughput and up to a 31.095% average compression ratio improvement while occupying low CPU or GPU memory.

Cite

Text

Sun et al. "Genomics Data Lossless Compression with (s, K)-Mer Encoding and Deep Neural Networks." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I12.33371

Markdown

[Sun et al. "Genomics Data Lossless Compression with (s, K)-Mer Encoding and Deep Neural Networks." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/sun2025aaai-genomics/) doi:10.1609/AAAI.V39I12.33371

BibTeX

@inproceedings{sun2025aaai-genomics,
  title     = {{Genomics Data Lossless Compression with (s, K)-Mer Encoding and Deep Neural Networks}},
  author    = {Sun, Hui and Yi, Liping and Ma, Huidong and Sun, Yongxia and Zheng, Yingfeng and Cui, Wenwen and Yan, Meng and Wang, Gang and Liu, Xiaoguang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {12577-12585},
  doi       = {10.1609/AAAI.V39I12.33371},
  url       = {https://mlanthology.org/aaai/2025/sun2025aaai-genomics/}
}