MLP-Mixer: An All-MLP Architecture for Vision

Abstract

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Cite

Text

Tolstikhin et al. "MLP-Mixer: An All-MLP Architecture for Vision." Neural Information Processing Systems, 2021.

Markdown

[Tolstikhin et al. "MLP-Mixer: An All-MLP Architecture for Vision." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/tolstikhin2021neurips-mlpmixer/)

BibTeX

@inproceedings{tolstikhin2021neurips-mlpmixer,
  title     = {{MLP-Mixer: An All-MLP Architecture for Vision}},
  author    = {Tolstikhin, Ilya O and Houlsby, Neil and Kolesnikov, Alexander and Beyer, Lucas and Zhai, Xiaohua and Unterthiner, Thomas and Yung, Jessica and Steiner, Andreas and Keysers, Daniel and Uszkoreit, Jakob and Lucic, Mario and Dosovitskiy, Alexey},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/tolstikhin2021neurips-mlpmixer/}
}