From Language Models over Tokens to Language Models over Characters
Abstract
Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.
Cite
Text
Vieira et al. "From Language Models over Tokens to Language Models over Characters." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Vieira et al. "From Language Models over Tokens to Language Models over Characters." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/vieira2025icml-language/)BibTeX
@inproceedings{vieira2025icml-language,
title = {{From Language Models over Tokens to Language Models over Characters}},
author = {Vieira, Tim and Lebrun, Benjamin and Giulianelli, Mario and Gastaldi, Juan Luis and Dusell, Brian and Terilla, John and O’Donnell, Timothy J. and Cotterell, Ryan},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {61391-61412},
volume = {267},
url = {https://mlanthology.org/icml/2025/vieira2025icml-language/}
}