Spatial-Channel Token Distillation for Vision MLPs

Abstract

Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.

Cite

Text

Li et al. "Spatial-Channel Token Distillation for Vision MLPs." International Conference on Machine Learning, 2022.

Markdown

[Li et al. "Spatial-Channel Token Distillation for Vision MLPs." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/li2022icml-spatialchannel/)

BibTeX

@inproceedings{li2022icml-spatialchannel,
  title     = {{Spatial-Channel Token Distillation for Vision MLPs}},
  author    = {Li, Yanxi and Chen, Xinghao and Dong, Minjing and Tang, Yehui and Wang, Yunhe and Xu, Chang},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {12685-12695},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/li2022icml-spatialchannel/}
}