When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes

Abstract

The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte-Pair Encoding (BPE) to nine T2T primate genomes—including three human assemblies—by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.

Cite

Text

Popova et al. "When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes." ICLR 2025 Workshops: MLGenX, 2025.

Markdown

[Popova et al. "When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes." ICLR 2025 Workshops: MLGenX, 2025.](https://mlanthology.org/iclrw/2025/popova2025iclrw-repeats/)

BibTeX

@inproceedings{popova2025iclrw-repeats,
  title     = {{When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes}},
  author    = {Popova, Marina and Chelombitko, Iaroslav and Komissarov, Aleksey},
  booktitle = {ICLR 2025 Workshops: MLGenX},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/popova2025iclrw-repeats/}
}