DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
Abstract
Transformers have been successfully applied to computer vision due to its powerful modelling capacity with self-attention. However, the good performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improvethe data-efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can also be applied to the extreme data-free case where no real images are available, where we propose a boundary-preserving intra-divergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods.
Cite
Text
Chen et al. "DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01174Markdown
[Chen et al. "DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/chen2022cvpr-dearkd/) doi:10.1109/CVPR52688.2022.01174BibTeX
@inproceedings{chen2022cvpr-dearkd,
title = {{DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers}},
author = {Chen, Xianing and Cao, Qiong and Zhong, Yujie and Zhang, Jing and Gao, Shenghua and Tao, Dacheng},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {12052-12062},
doi = {10.1109/CVPR52688.2022.01174},
url = {https://mlanthology.org/cvpr/2022/chen2022cvpr-dearkd/}
}