Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Abstract

GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. Based on this dialog history between the Questioner and the Oracle, a Guesser agent makes a final guess of the target object. While previous work has focused on dialogue policy optimization and visual-linguistic information fusion, most work learns the vision-linguistic encoding for the three agents solely on the GuessWhat?! dataset without shared and prior knowledge of vision-linguistic representation. To bridge these gaps, this paper proposes new Oracle, Guesser and Questioner models that take advantage of a pretrained vision-linguistic model, VilBert. For Oracle model, we introduce a two-way background/target fusion mechanism to understand both intra and inter-object questions. For Guesser model, we introduce a state-estimator that best utilizes Vilbert's strength in single-turn referring expression comprehension. For the Questioner, we share the state-estimator from pretrained Guesser with Questioner to guide the question generator. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7%, 10%, 12% for Oracle, Guesser and End-to-End Questioner respectively.

Cite

Text

Tu et al. "Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00557

Markdown

[Tu et al. "Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/tu2021cvpr-learning/) doi:10.1109/CVPR46437.2021.00557

BibTeX

@inproceedings{tu2021cvpr-learning,
  title     = {{Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation}},
  author    = {Tu, Tao and Ping, Qing and Thattai, Govindarajan and Tur, Gokhan and Natarajan, Prem},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {5622-5631},
  doi       = {10.1109/CVPR46437.2021.00557},
  url       = {https://mlanthology.org/cvpr/2021/tu2021cvpr-learning/}
}