When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes
Abstract
The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte-Pair Encoding (BPE) to nine T2T primate genomes—including three human assemblies—by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.
Cite
Text
Popova et al. "When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes." ICLR 2025 Workshops: MLGenX, 2025.Markdown
[Popova et al. "When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes." ICLR 2025 Workshops: MLGenX, 2025.](https://mlanthology.org/iclrw/2025/popova2025iclrw-repeats/)BibTeX
@inproceedings{popova2025iclrw-repeats,
title = {{When Repeats Drive the Vocabulary: A Byte-Pair Encoding Analysis of T2T Primate Genomes}},
author = {Popova, Marina and Chelombitko, Iaroslav and Komissarov, Aleksey},
booktitle = {ICLR 2025 Workshops: MLGenX},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/popova2025iclrw-repeats/}
}