DOT: A Distillation-Oriented Trainer

Abstract

Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.

Cite

Text

Zhao et al. "DOT: A Distillation-Oriented Trainer." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00569

Markdown

[Zhao et al. "DOT: A Distillation-Oriented Trainer." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/zhao2023iccv-dot/) doi:10.1109/ICCV51070.2023.00569

BibTeX

@inproceedings{zhao2023iccv-dot,
  title     = {{DOT: A Distillation-Oriented Trainer}},
  author    = {Zhao, Borui and Cui, Quan and Song, Renjie and Liang, Jiajun},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {6189-6198},
  doi       = {10.1109/ICCV51070.2023.00569},
  url       = {https://mlanthology.org/iccv/2023/zhao2023iccv-dot/}
}