FlexAttention for Efficient High-Resolution Vision-Language Models

Abstract

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose , a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our outperforms existing high-resolution VLMs (e.g., relatively ∼9% in V* Bench, ∼7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.1 1 Project page: https://vis-www.cs.umass.edu/flexattention

Cite

Text

Li et al. "FlexAttention for Efficient High-Resolution Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72698-9_17

Markdown

[Li et al. "FlexAttention for Efficient High-Resolution Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/li2024eccv-flexattention/) doi:10.1007/978-3-031-72698-9_17

BibTeX

@inproceedings{li2024eccv-flexattention,
  title     = {{FlexAttention for Efficient High-Resolution Vision-Language Models}},
  author    = {Li, Junyan and Chen, Delin and Cai, Tianle and Chen, Peihao and Hong, Yining and Chen, Zhenfang and Shen, Yikang and Gan, Chuang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72698-9_17},
  url       = {https://mlanthology.org/eccv/2024/li2024eccv-flexattention/}
}