LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

Abstract

Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from the other modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations with dual-stream encoders. However, this approach is limited by slow retrieval speeds in large-scale scenarios. To address this issue, we propose a novel sparse retrieval paradigm for ITR that exploits sparse representations in the vocabulary space for images and texts. This paradigm enables us to leverage bag-of-words models and efficient inverted indexes, significantly reducing retrieval latency. A critical gap emerges from representing continuous image data in a sparse vocabulary space. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. By using lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, we are able to construct continuous bag-of-words bottlenecks and learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two ITR benchmarks, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with 5.8x faster retrieval speed and 19.1x less index storage memory. Beyond this, LexLIP surpasses CLIP across 8 out of 10 zero-shot image classification tasks.

Cite

Text

Luo et al. "LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01029

Markdown

[Luo et al. "LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/luo2023iccv-lexlip/) doi:10.1109/ICCV51070.2023.01029

BibTeX

@inproceedings{luo2023iccv-lexlip,
  title     = {{LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval}},
  author    = {Luo, Ziyang and Zhao, Pu and Xu, Can and Geng, Xiubo and Shen, Tao and Tao, Chongyang and Ma, Jing and Lin, Qingwei and Jiang, Daxin},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {11206-11217},
  doi       = {10.1109/ICCV51070.2023.01029},
  url       = {https://mlanthology.org/iccv/2023/luo2023iccv-lexlip/}
}