RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering
Abstract
Natural Language Explanation (NLE) in vision and language tasks aims to provide human-understandable explanations for the associated decision-making process. In practice, one might encounter explanations which lack informativeness or contradict visual-grounded facts, known as implausibility and hallucination problems, respectively. To tackle these challenging issues, we consider the task of visual question answering (VQA) and introduce Rapper, a two-stage Reinforced Rationale-Prompted Paradigm. By knowledge distillation, the former stage of Rapper infuses rationale-prompting via large language models (LLMs), encouraging the rationales supported by language-based facts. As for the latter stage, a unique Reinforcement Learning from NLE Feedback (RLNF) is introduced for injecting visual facts into NLE generation. Finally, quantitative and qualitative experiments on two VL-NLE benchmarks show that Rapper surpasses state-of-the-art VQA-NLE methods while providing plausible and faithful NLE.
Cite
Text
Chang et al. "RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering." International Conference on Learning Representations, 2024.Markdown
[Chang et al. "RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/chang2024iclr-rapper/)BibTeX
@inproceedings{chang2024iclr-rapper,
title = {{RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering}},
author = {Chang, Kai-Po and Huang, Chi-Pin and Cheng, Wei-Yuan and Yang, Fu-En and Wang, Chien-Yi and Lai, Yung-Hsuan and Wang, Yu-Chiang Frank},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/chang2024iclr-rapper/}
}