Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-like Architectures
Abstract
Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model that builds upon the RWKV architecture from the NLP field with key modifications tailored specifically for vision tasks. Similar to the Vision Transformer (ViT), our model demonstrates robust global processing capabilities, efficiently handles sparse inputs like masked images, and can scale up to accommodate both large-scale parameters and extensive datasets. Its distinctive advantage is its reduced spatial aggregation complexity, enabling seamless processing of high-resolution images without the need for window operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code and models are available at~\url{https://github.com/OpenGVLab/Vision-RWKV}.
Cite
Text
Duan et al. "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-like Architectures." International Conference on Learning Representations, 2025.Markdown
[Duan et al. "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-like Architectures." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/duan2025iclr-visionrwkv/)BibTeX
@inproceedings{duan2025iclr-visionrwkv,
title = {{Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-like Architectures}},
author = {Duan, Yuchen and Wang, Weiyun and Chen, Zhe and Zhu, Xizhou and Lu, Lewei and Lu, Tong and Qiao, Yu and Li, Hongsheng and Dai, Jifeng and Wang, Wenhai},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/duan2025iclr-visionrwkv/}
}