Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Abstract
Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.
Cite
Text
Wei et al. "Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73235-5_23Markdown
[Wei et al. "Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wei2024eccv-vary/) doi:10.1007/978-3-031-73235-5_23BibTeX
@inproceedings{wei2024eccv-vary,
title = {{Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models}},
author = {Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73235-5_23},
url = {https://mlanthology.org/eccv/2024/wei2024eccv-vary/}
}