Adding Object Detection Skills to Visual Dialogue Agents

Abstract

Our goal is to equip a dialogue agent that asks questions about a visual scene with object detection skills. We take the first steps in this direction within the GuessWhat?! game. We use Mask R-CNN object features as a replacement for ground-truth annotations in the Guesser module, achieving an accuracy of 57.92%. This proves that our system is a viable alternative to the original Guesser, which achieves an accuracy of 62.77% using ground-truth annotations, and thus should be considered an upper bound for our automated system. Crucially, we show that our system exploits the Mask R-CNN object features, in contrast to the original Guesser augmented with global, VGG features. Furthermore, by automating the object detection in GuessWhat?!, we open up a spectrum of opportunities, such as playing the game with new, non-annotated images and using the more granular visual features to condition the other modules of the game architecture.

Cite

Text

Bani et al. "Adding Object Detection Skills to Visual Dialogue Agents." European Conference on Computer Vision Workshops, 2018. doi:10.1007/978-3-030-11018-5_17

Markdown

[Bani et al. "Adding Object Detection Skills to Visual Dialogue Agents." European Conference on Computer Vision Workshops, 2018.](https://mlanthology.org/eccvw/2018/bani2018eccvw-adding/) doi:10.1007/978-3-030-11018-5_17

BibTeX

@inproceedings{bani2018eccvw-adding,
  title     = {{Adding Object Detection Skills to Visual Dialogue Agents}},
  author    = {Bani, Gabriele and Belli, Davide and Dagan, Gautier and Geenen, Alexander and Skliar, Andrii and Venkatesh, Aashish and Baumgärtner, Tim and Bruni, Elia and Fernández, Raquel},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2018},
  pages     = {180-187},
  doi       = {10.1007/978-3-030-11018-5_17},
  url       = {https://mlanthology.org/eccvw/2018/bani2018eccvw-adding/}
}