Fine-Grained Late-Interaction Multi-Modal Retrieval for Retrieval Augmented Visual Question Answering

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim62$% VQA score in the OK-VQA dataset.

Cite

Text

Lin et al. "Fine-Grained Late-Interaction Multi-Modal Retrieval for Retrieval Augmented Visual Question Answering." Neural Information Processing Systems, 2023.

Markdown

[Lin et al. "Fine-Grained Late-Interaction Multi-Modal Retrieval for Retrieval Augmented Visual Question Answering." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/lin2023neurips-finegrained/)

BibTeX

@inproceedings{lin2023neurips-finegrained,
  title     = {{Fine-Grained Late-Interaction Multi-Modal Retrieval for Retrieval Augmented Visual Question Answering}},
  author    = {Lin, Weizhe and Chen, Jinghong and Mei, Jingbiao and Coca, Alexandru and Byrne, Bill},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/lin2023neurips-finegrained/}
}