GLIPv2: Unifying Localization and Vision-Language Understanding

Abstract

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

Cite

Text

Zhang et al. "GLIPv2: Unifying Localization and Vision-Language Understanding." Neural Information Processing Systems, 2022.

Markdown

[Zhang et al. "GLIPv2: Unifying Localization and Vision-Language Understanding." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/zhang2022neurips-glipv2/)

BibTeX

@inproceedings{zhang2022neurips-glipv2,
  title     = {{GLIPv2: Unifying Localization and Vision-Language Understanding}},
  author    = {Zhang, Haotian and Zhang, Pengchuan and Hu, Xiaowei and Chen, Yen-Chun and Li, Liunian and Dai, Xiyang and Wang, Lijuan and Yuan, Lu and Hwang, Jenq-Neng and Gao, Jianfeng},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/zhang2022neurips-glipv2/}
}