Distillation Scaling Laws

Abstract

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Cite

Text

Busbridge et al. "Distillation Scaling Laws." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Busbridge et al. "Distillation Scaling Laws." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/busbridge2025icml-distillation/)

BibTeX

@inproceedings{busbridge2025icml-distillation,
  title     = {{Distillation Scaling Laws}},
  author    = {Busbridge, Dan and Shidani, Amitis and Weers, Floris and Ramapuram, Jason and Littwin, Etai and Webb, Russell},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {5977-6045},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/busbridge2025icml-distillation/}
}