Knowledge Distillation: The Functional Perspective
Abstract
Empirical findings of accuracy correlations between students and teachers in the knowledge distillation framework have served as supporting evidence for knowl- edge transfer. In this paper, we sought to explain and understand the knowledge transfer derived from knowledge distillation via functional similarity, hypothesising that knowledge distillation provides a functionally similar student to its teacher model. While we accept this hypothesis for two out of three architectures across a range of metrics for functional analysis against four controls, the results show that knowledge transfer is significant but it is less pronounced than expected for conditions that maximise opportunities for functional similarity. Furthermore, results from the use of Uniform and Gaussian Noise as teachers suggest that the knowledge-sharing aspects of knowledge distillation inadequately describe the accuracy benefits witnessed when using the knowledge distillation training setup itself. Moreover, in the first instance, we show that knowledge distillation is not a compression mechanism but primarily a data-dependent training regulariser with a small capacity to transfer knowledge in the best case.
Cite
Text
Mason-Williams et al. "Knowledge Distillation: The Functional Perspective." NeurIPS 2024 Workshops: SciForDL, 2024.Markdown
[Mason-Williams et al. "Knowledge Distillation: The Functional Perspective." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/masonwilliams2024neuripsw-knowledge/)BibTeX
@inproceedings{masonwilliams2024neuripsw-knowledge,
title = {{Knowledge Distillation: The Functional Perspective}},
author = {Mason-Williams, Israel and Mason-Williams, Gabryel and Sandler, Mark},
booktitle = {NeurIPS 2024 Workshops: SciForDL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/masonwilliams2024neuripsw-knowledge/}
}