Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
Abstract
In many language processing tasks including most notably Large Language Modeling (LLM), retrieval augmentation improves the performance of the models by adding information during inference that may not be present in the model’s weights. This technique has been shown to be particularly useful in multimodal settings. For some tasks, like Outside Knowledge Visual Question Answering (OK-VQA), retrieval augmentation is required given the open nature of the knowledge. In many prior works for the OK-VQA task, the retriever is either a unimodal language retriever or an untrained cross-modal retriever. In this work, we present a weakly supervised training approach for cross-modal retrievers. Our method takes inspiration from the natural language modeling task of information retrieval and extends those methods to cross-modal retrieval. Since the OK-VQA task does not typically have consistent ground truth retrieval labels, we evaluate our model using lexical overlap between the ground truth and the retrieved passage. Our approach showed an average recall improvement of 28% across a large range of retrieval sizes compared to a baseline backbone network.
Cite
Text
Reichman and Heck. "Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00304Markdown
[Reichman and Heck. "Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/reichman2023iccvw-crossmodal/) doi:10.1109/ICCVW60793.2023.00304BibTeX
@inproceedings{reichman2023iccvw-crossmodal,
title = {{Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering}},
author = {Reichman, Benjamin Z. and Heck, Larry},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2023},
pages = {2829-2834},
doi = {10.1109/ICCVW60793.2023.00304},
url = {https://mlanthology.org/iccvw/2023/reichman2023iccvw-crossmodal/}
}