Training-Free Visual Token Compression via Delayed Spatial Merging

Abstract

Token compression is an emerging paradigm that accelerates the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.

Cite

Text

Heo et al. "Training-Free Visual Token Compression via Delayed Spatial Merging." NeurIPS 2024 Workshops: Compression, 2024.

Markdown

[Heo et al. "Training-Free Visual Token Compression via Delayed Spatial Merging." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/heo2024neuripsw-trainingfree/)

BibTeX

@inproceedings{heo2024neuripsw-trainingfree,
  title     = {{Training-Free Visual Token Compression via Delayed Spatial Merging}},
  author    = {Heo, Jung Hwan and Azizi, Seyedarmin and Fayyazi, Arash and Pedram, Massoud},
  booktitle = {NeurIPS 2024 Workshops: Compression},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/heo2024neuripsw-trainingfree/}
}