LPViT: Low-Power Semi-Structured Pruning for Vision Transformers

Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin

ECCV 2024

doi:10.1007/978-3-031-73209-6_16 /eccv/2024/xu2024eccv-lpvit/

Abstract

Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks (CNNs) for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93× and 1.79× speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4× on real-world GPUs. Code will be released soon.

PDF ECCV Semantic Scholar

Cite

Text

Xu et al. "LPViT: Low-Power Semi-Structured Pruning for Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73209-6_16

Markdown

[Xu et al. "LPViT: Low-Power Semi-Structured Pruning for Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/xu2024eccv-lpvit/) doi:10.1007/978-3-031-73209-6_16

BibTeX

@inproceedings{xu2024eccv-lpvit,
  title     = {{LPViT: Low-Power Semi-Structured Pruning for Vision Transformers}},
  author    = {Xu, Kaixin and Wang, Zhe and Chen, Chunyun and Geng, Xue and Lin, Jie and Yang, Xulei and Wu, Min and Li, Xiaoli and Lin, Weisi},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73209-6_16},
  url       = {https://mlanthology.org/eccv/2024/xu2024eccv-lpvit/}
}