ViTKD: Feature-Based Knowledge Distillation for Vision Transformers
Abstract
Knowledge Distillation (KD) has been extensively studied as a means to enhance the performance of smaller models in Convolutional Neural Networks (CNNs). Recently, the Vision Transformer (ViT) has demonstrated remarkable success in various computer vision tasks, leading to an increased demand for KD in ViT. However, while logit-based KD has been applied to ViT, other feature-based KD methods for CNNs cannot be directly implemented due to the significant structure gap. In this paper, we conduct an analysis of the properties of different feature layers in ViT to identify a method for feature-based ViT distillation. Our findings reveal that both shallow and deep layers in ViT are equally important for distillation and require distinct distillation strategies. Based on these guidelines, we propose our feature-based method ViTKD, which mimics the shallow layers and generates the deep layer in the teacher. ViTKD leads to consistent and significant improvements in the students. On ImageNet-1K, we achieve performance boosts of 1.64% for DeiT-Tiny, 1.40% for DeiT-Small, and 1.70% for DeiT-Base. Downstream tasks also demonstrate the superiority of ViTKD. Additionally, ViTKD and logit-based KD are complementary and can be applied together directly, further enhancing the student’s performance. Specifically, DeiT-T, S, and B achieve accuracies of 77.78%, 83.59%, and 85.41%, respectively, using this combined approach. Code is available at https://github.com/yzd-v/cls_KD.
Cite
Text
Yang et al. "ViTKD: Feature-Based Knowledge Distillation for Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00145Markdown
[Yang et al. "ViTKD: Feature-Based Knowledge Distillation for Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/yang2024cvprw-vitkd/) doi:10.1109/CVPRW63382.2024.00145BibTeX
@inproceedings{yang2024cvprw-vitkd,
title = {{ViTKD: Feature-Based Knowledge Distillation for Vision Transformers}},
author = {Yang, Zhendong and Li, Zhe and Zeng, Ailing and Li, Zexian and Yuan, Chun and Li, Yu},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {1379-1388},
doi = {10.1109/CVPRW63382.2024.00145},
url = {https://mlanthology.org/cvprw/2024/yang2024cvprw-vitkd/}
}