Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye

NeurIPS 2025

/neurips/2025/hong2025neurips-knowledgebased/

Abstract

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Hong et al. "Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering." Advances in Neural Information Processing Systems, 2025.

Markdown

[Hong et al. "Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/hong2025neurips-knowledgebased/)

BibTeX

@inproceedings{hong2025neurips-knowledgebased,
  title     = {{Knowledge-Based Visual Question Answer with Multimodal Processing, Retrieval and Filtering}},
  author    = {Hong, Yuyang and Gu, Jiaqi and Yang, Qi and Fan, Lubin and Wu, Yue and Wang, Ying and Ding, Kun and Xiang, Shiming and Ye, Jieping},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/hong2025neurips-knowledgebased/}
}