Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Abstract

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

Cite

Text

Hong et al. "Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering." Advances in Neural Information Processing Systems, 2025.

Markdown

[Hong et al. "Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/hong2025neurips-knowledgebased/)

BibTeX

@inproceedings{hong2025neurips-knowledgebased,
  title     = {{Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering}},
  author    = {Hong, Yuyang and Gu, Jiaqi and Yang, Qi and Fan, Lubin and Wu, Yue and Wang, Ying and Ding, Kun and Xiang, Shiming and Ye, Jieping},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/hong2025neurips-knowledgebased/}
}