Searching for Phenotypic Needles in Genomic Haystacks: DNA Language Models for Sex Prediction

Abstract

In this study, we explore fine-tuning of Genomic Language Models (GLM) to predict phenotypic traits directly from genomic sequence, without prior knowledge about causative loci or molecular mechanisms linking genotype to phenotype. As a case study, we focus on sex prediction, a well-defined genomic feature associated with the presence of the Y chromosome in most mammals. We adapt a pre-trained GENA-LM model for trait prediction by introducing a sequence chunk classification component with cross-attention, enabling the model to process larger genomic contexts. Training and evaluation on human and mouse genomes demonstrate that the model does not require high-quality reference genome assembly and converges even when the fraction of genomic signal associated with phenotype is below 1%. Prediction accuracy improves with increased sequencing depth, highlighting the scalability of GLMs for genome-wide tasks. Furthermore, a multi-species model effectively learns sex-specific signals for both human and mouse, confirming its cross-species predictive ability. Ablation studies demonstrate that the model relies on the Y chromosome for sex prediction, that aligns with real biological principles. Our findings highlight the applicability of GLMs for trait prediction in long and fragmented genomic data.

Cite

Text

Chepurova et al. "Searching for Phenotypic Needles in Genomic Haystacks: DNA Language Models for Sex Prediction." ICLR 2025 Workshops: MLGenX, 2025.

Markdown

[Chepurova et al. "Searching for Phenotypic Needles in Genomic Haystacks: DNA Language Models for Sex Prediction." ICLR 2025 Workshops: MLGenX, 2025.](https://mlanthology.org/iclrw/2025/chepurova2025iclrw-searching/)

BibTeX

@inproceedings{chepurova2025iclrw-searching,
  title     = {{Searching for Phenotypic Needles in Genomic Haystacks: DNA Language Models for Sex Prediction}},
  author    = {Chepurova, Alla and Kuratov, Yuri and Belokopytova, Polina and Burtsev, Mikhail and Fishman, Veniamin},
  booktitle = {ICLR 2025 Workshops: MLGenX},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/chepurova2025iclrw-searching/}
}