Knowledge Distillation from Internal Representations
Abstract
Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.
Cite
Text
Aguilar et al. "Knowledge Distillation from Internal Representations." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I05.6229Markdown
[Aguilar et al. "Knowledge Distillation from Internal Representations." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/aguilar2020aaai-knowledge/) doi:10.1609/AAAI.V34I05.6229BibTeX
@inproceedings{aguilar2020aaai-knowledge,
title = {{Knowledge Distillation from Internal Representations}},
author = {Aguilar, Gustavo and Ling, Yuan and Zhang, Yu and Yao, Benjamin Z. and Fan, Xing and Guo, Chenlei},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2020},
pages = {7350-7357},
doi = {10.1609/AAAI.V34I05.6229},
url = {https://mlanthology.org/aaai/2020/aguilar2020aaai-knowledge/}
}