AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Abstract

Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.

Cite

Text

Pei et al. "AntigenLM: Structure-Aware DNA Language Modeling for Influenza." International Conference on Learning Representations, 2026.

Markdown

[Pei et al. "AntigenLM: Structure-Aware DNA Language Modeling for Influenza." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/pei2026iclr-antigenlm/)

BibTeX

@inproceedings{pei2026iclr-antigenlm,
  title     = {{AntigenLM: Structure-Aware DNA Language Modeling for Influenza}},
  author    = {Pei, Yue and Chi, Xuebin and Kang, Yu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/pei2026iclr-antigenlm/}
}