Turbo: Informativity-Driven Acceleration Plug-in for Vision-Language Large Models

Abstract

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs’ calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.

Cite

Text

Ju et al. "Turbo: Informativity-Driven Acceleration Plug-in for Vision-Language Large Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72952-2_25

Markdown

[Ju et al. "Turbo: Informativity-Driven Acceleration Plug-in for Vision-Language Large Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ju2024eccv-turbo/) doi:10.1007/978-3-031-72952-2_25

BibTeX

@inproceedings{ju2024eccv-turbo,
  title     = {{Turbo: Informativity-Driven Acceleration Plug-in for Vision-Language Large Models}},
  author    = {Ju, Chen and Wang, Haicheng and Cheng, Haozhe and Chen, Xu and Zhai, Zhonghua and Huang, Weilin and Lan, Jinsong and Xiao, Shuai and Zheng, Bo},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72952-2_25},
  url       = {https://mlanthology.org/eccv/2024/ju2024eccv-turbo/}
}