Generic-to-Specific Distillation of Masked Autoencoders

Abstract

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

Cite

Text

Huang et al. "Generic-to-Specific Distillation of Masked Autoencoders." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01535

Markdown

[Huang et al. "Generic-to-Specific Distillation of Masked Autoencoders." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/huang2023cvpr-generictospecific/) doi:10.1109/CVPR52729.2023.01535

BibTeX

@inproceedings{huang2023cvpr-generictospecific,
  title     = {{Generic-to-Specific Distillation of Masked Autoencoders}},
  author    = {Huang, Wei and Peng, Zhiliang and Dong, Li and Wei, Furu and Jiao, Jianbin and Ye, Qixiang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {15996-16005},
  doi       = {10.1109/CVPR52729.2023.01535},
  url       = {https://mlanthology.org/cvpr/2023/huang2023cvpr-generictospecific/}
}