Tag2Text: Guiding Vision-Language Model via Image Tagging
Abstract
This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with a limited detector, our approach utilizes tags parsed from its paired text to learn an image tagger and meanwhile provides guidance to vision-language models. Given that, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. Strikingly, Tag2Text showcases the ability of a foundational image tagging model, with superior zero-shot performance even comparable to full supervision manner. Moreover, by leveraging tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance.
Cite
Text
Huang et al. "Tag2Text: Guiding Vision-Language Model via Image Tagging." International Conference on Learning Representations, 2024.Markdown
[Huang et al. "Tag2Text: Guiding Vision-Language Model via Image Tagging." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/huang2024iclr-tag2text/)BibTeX
@inproceedings{huang2024iclr-tag2text,
title = {{Tag2Text: Guiding Vision-Language Model via Image Tagging}},
author = {Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/huang2024iclr-tag2text/}
}