Prompt-RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering

Abstract

Remote sensing visual question answering (RQA) was recently proposed with the aim of interfacing natural language and vision to ease the access of information contained in Earth Observation data for a wide audience, which is granted by simple questions in natural language. The traditional vision/language interface is an embedding obtained by fusing features from two deep models, one processing the image and another the question. Despite the success of early VQA models, it remains difficult to control the adequacy of the visual information extracted by its deep model, which should act as a context regularizing the work of the language model. We propose to extract this context information with a visual model, convert it to text and inject it, i.e. prompt it, into a language model. The language model is therefore responsible to process the question with the visual context, and extract features, which are useful to find the answer. We study the effect of prompting with respect to a black-box visual extractor and discuss the importance of training a visual model producing accurate context.

Cite

Text

Chappuis et al. "Prompt-RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00143

Markdown

[Chappuis et al. "Prompt-RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/chappuis2022cvprw-promptrsvqa/) doi:10.1109/CVPRW56347.2022.00143

BibTeX

@inproceedings{chappuis2022cvprw-promptrsvqa,
  title     = {{Prompt-RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering}},
  author    = {Chappuis, Christel and Zermatten, Valérie and Lobry, Sylvain and Le Saux, Bertrand and Tuia, Devis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2022},
  pages     = {1371-1380},
  doi       = {10.1109/CVPRW56347.2022.00143},
  url       = {https://mlanthology.org/cvprw/2022/chappuis2022cvprw-promptrsvqa/}
}