Testing Knowledge Distillation Theories with Dataset Size
Abstract
The concept of knowledge distillation (KD) describes the training of a student model with a teacher model and is a widespread technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, and consistently observe that the gap in test error between distillation and the standard label training is increased as the dataset size is reduced. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.
Cite
Text
Lanzillotta et al. "Testing Knowledge Distillation Theories with Dataset Size." NeurIPS 2024 Workshops: SciForDL, 2024.Markdown
[Lanzillotta et al. "Testing Knowledge Distillation Theories with Dataset Size." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/lanzillotta2024neuripsw-testing/)BibTeX
@inproceedings{lanzillotta2024neuripsw-testing,
title = {{Testing Knowledge Distillation Theories with Dataset Size}},
author = {Lanzillotta, Giulia and Sarnthein, Felix and Kur, Gil and Hofmann, Thomas and He, Bobby},
booktitle = {NeurIPS 2024 Workshops: SciForDL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/lanzillotta2024neuripsw-testing/}
}