A Partition Cover Approach to Tokenization

Abstract

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

Cite

Text

Lim et al. "A Partition Cover Approach to Tokenization." Advances in Neural Information Processing Systems, 2025.

Markdown

[Lim et al. "A Partition Cover Approach to Tokenization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lim2025neurips-partition/)

BibTeX

@inproceedings{lim2025neurips-partition,
  title     = {{A Partition Cover Approach to Tokenization}},
  author    = {Lim, Jia Peng and Tan, Shawn and Choo, Davin and Lauw, Hady W.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/lim2025neurips-partition/}
}