Visual Question Answering with Textual Representations for Images

Abstract

How far can we go with textual representations for understanding pictures? Deep visual features extracted by object recognition models are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Mean-while, with recent language models’ progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA.

Cite

Text

Hirota et al. "Visual Question Answering with Textual Representations for Images." IEEE/CVF International Conference on Computer Vision Workshops, 2021. doi:10.1109/ICCVW54120.2021.00353

Markdown

[Hirota et al. "Visual Question Answering with Textual Representations for Images." IEEE/CVF International Conference on Computer Vision Workshops, 2021.](https://mlanthology.org/iccvw/2021/hirota2021iccvw-visual/) doi:10.1109/ICCVW54120.2021.00353

BibTeX

@inproceedings{hirota2021iccvw-visual,
  title     = {{Visual Question Answering with Textual Representations for Images}},
  author    = {Hirota, Yusuke and Garcia, Noa and Otani, Mayu and Chu, Chenhui and Nakashima, Yuta and Taniguchi, Ittetsu and Onoye, Takao},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2021},
  pages     = {3147-3150},
  doi       = {10.1109/ICCVW54120.2021.00353},
  url       = {https://mlanthology.org/iccvw/2021/hirota2021iccvw-visual/}
}