Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation

Abstract

Knowledge distillation (KD) can effectively compress neural networks by training a smaller network (student) to simulate the behavior of a larger one (teacher). A counter-intuitive observation is that a more expansive teacher does not make a better student, but the reasons for this phenomenon remain unclear. In this paper, we demonstrate that this is directly attributed to the presence of \textit{undistillable classes}: when trained with distillation, the teacher's knowledge of some classes is incomprehensible to the student model. We observe that while KD improves the overall accuracy, it is at the cost of the model becoming inaccurate in these undistillable classes. After establishing their widespread existence in state-of-the-art distillation methods, we illustrate their correlation with the capacity gap between teacher and student models. Finally, we present a simple Teach Less Learn More (TLLM) framework to identify and discard the undistillable classes during training. We validate the effectiveness of our approach on multiple datasets with varying network architectures. In all settings, our proposed method is able to exceed the performance of competitive state-of-the-art techniques.

Cite

Text

Zhu et al. "Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation." Neural Information Processing Systems, 2022.

Markdown

[Zhu et al. "Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/zhu2022neurips-teach/)

BibTeX

@inproceedings{zhu2022neurips-teach,
  title     = {{Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation}},
  author    = {Zhu, Yichen and Liu, Ning and Xu, Zhiyuan and Liu, Xin and Meng, Weibin and Wang, Louis and Ou, Zhicai and Tang, Jian},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/zhu2022neurips-teach/}
}