Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abstract

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative `image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to end-to-end learn the policies of these agents -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a `sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, ie, symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/size). Thus, we demonstrate the emergence of grounded language and communication among `visual' dialog agents with no human supervision at all. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain on dialog data and show that the RL fine-tuned agents significantly outperform supervised pretraining. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.

Cite

Text

Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." International Conference on Computer Vision, 2017. doi:10.1109/ICCV.2017.321

Markdown

[Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." International Conference on Computer Vision, 2017.](https://mlanthology.org/iccv/2017/das2017iccv-learning/) doi:10.1109/ICCV.2017.321

BibTeX

@inproceedings{das2017iccv-learning,
  title     = {{Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning}},
  author    = {Das, Abhishek and Kottur, Satwik and Moura, Jose M. F. and Lee, Stefan and Batra, Dhruv},
  booktitle = {International Conference on Computer Vision},
  year      = {2017},
  doi       = {10.1109/ICCV.2017.321},
  url       = {https://mlanthology.org/iccv/2017/das2017iccv-learning/}
}