SPFormer: Enhancing Vision Transformer with Superpixel Representation

Abstract

This work introduces SPFormer, a novel Vision Transformer architecture enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer divides the input image into irregular, semantically coherent regions (i.e., superpixels), effectively capturing intricate details. Notably, this is also applicable to intermediate features, and our whole model supports end-to-end training, empirically yielding superior performance across multiple benchmarks. For example, on the challenging ImageNet benchmark, SPFormer outperforms DeiT by 1.4% at the tiny-model size and by 1.1% at the small-model size. Moreover, a standout feature of SPFormer is its inherent explainability — the superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability and stronger robustness against challenging scenarios like image rotations and occlusions.

Cite

Text

Mei et al. "SPFormer: Enhancing Vision Transformer with Superpixel Representation." Transactions on Machine Learning Research, 2025.

Markdown

[Mei et al. "SPFormer: Enhancing Vision Transformer with Superpixel Representation." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/mei2025tmlr-spformer/)

BibTeX

@article{mei2025tmlr-spformer,
  title     = {{SPFormer: Enhancing Vision Transformer with Superpixel Representation}},
  author    = {Mei, Jieru and Chen, Liang-Chieh and Yuille, Alan and Xie, Cihang},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/mei2025tmlr-spformer/}
}