Universal Vision-Language Dense Retrieval: Learning a Unified Representation Space for Multi-Modal Retrieval

Liu, Zhenghao; Xiong, Chenyan; Lv, Yuanhuiyi; Liu, Zhiyuan; Yu, Ge

Universal Vision-Language Dense Retrieval: Learning a Unified Representation Space for Multi-Modal Retrieval

Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu

ICLR 2023

/iclr/2023/liu2023iclr-universal/

Abstract

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes of this work are available at https://github.com/OpenMatch/UniVL-DR.

PDF ICLR Semantic Scholar

Cite

Text

Liu et al. "Universal Vision-Language Dense Retrieval: Learning a Unified Representation Space for Multi-Modal Retrieval." International Conference on Learning Representations, 2023.

Markdown

[Liu et al. "Universal Vision-Language Dense Retrieval: Learning a Unified Representation Space for Multi-Modal Retrieval." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/liu2023iclr-universal/)

BibTeX

@inproceedings{liu2023iclr-universal,
  title     = {{Universal Vision-Language Dense Retrieval: Learning a Unified Representation Space for Multi-Modal Retrieval}},
  author    = {Liu, Zhenghao and Xiong, Chenyan and Lv, Yuanhuiyi and Liu, Zhiyuan and Yu, Ge},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/liu2023iclr-universal/}
}