ViG: Linear-Complexity Visual Sequence Learning with Gated Linear Attention

Abstract

Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models.

Cite

Text

Liao et al. "ViG: Linear-Complexity Visual Sequence Learning with Gated Linear Attention." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I5.32550

Markdown

[Liao et al. "ViG: Linear-Complexity Visual Sequence Learning with Gated Linear Attention." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/liao2025aaai-vig/) doi:10.1609/AAAI.V39I5.32550

BibTeX

@inproceedings{liao2025aaai-vig,
  title     = {{ViG: Linear-Complexity Visual Sequence Learning with Gated Linear Attention}},
  author    = {Liao, Bencheng and Wang, Xinggang and Zhu, Lianghui and Zhang, Qian and Huang, Chang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {5182-5190},
  doi       = {10.1609/AAAI.V39I5.32550},
  url       = {https://mlanthology.org/aaai/2025/liao2025aaai-vig/}
}