Towards Cross-Tokenizer Distillation: The Universal Logit Distillation Loss for LLMs

Abstract

Deploying large language models (LLMs) with billions of parameters is often impractical in industrial settings due to constraints like cost, latency, and hardware limitations. Knowledge distillation (KD) provides a solution by compressing the knowledge from large, resource-intensive models into task-specific smaller ones. Various strategies exist, some relying on the text generated by the teacher model, optionally, leveraging its output logits to improve learning. However, these logit-based methods usually require the teacher and student models to share the same tokenizer, which limits their applicability across different model families. In this paper, we propose the Universal Logit Distillation (ULD) loss, which uses optimal transport theory to enable distillation across different architectures and tokenizers. Our results demonstrate that ULD loss effectively facilitates the distillation process, paving the way for a more widespread use of distillation.

Cite

Text

Boizard et al. "Towards Cross-Tokenizer Distillation: The Universal Logit Distillation Loss for LLMs." Transactions on Machine Learning Research, 2025.

Markdown

[Boizard et al. "Towards Cross-Tokenizer Distillation: The Universal Logit Distillation Loss for LLMs." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/boizard2025tmlr-crosstokenizer/)

BibTeX

@article{boizard2025tmlr-crosstokenizer,
  title     = {{Towards Cross-Tokenizer Distillation: The Universal Logit Distillation Loss for LLMs}},
  author    = {Boizard, Nicolas and El Haddad, Kevin and Hudelot, Celine and Colombo, Pierre},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/boizard2025tmlr-crosstokenizer/}
}