Towards Models That Can See and Read

Abstract

Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

Cite

Text

Ganz et al. "Towards Models That Can See and Read." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01985

Markdown

[Ganz et al. "Towards Models That Can See and Read." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/ganz2023iccv-models/) doi:10.1109/ICCV51070.2023.01985

BibTeX

@inproceedings{ganz2023iccv-models,
  title     = {{Towards Models That Can See and Read}},
  author    = {Ganz, Roy and Nuriel, Oren and Aberdam, Aviad and Kittenplon, Yair and Mazor, Shai and Litman, Ron},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {21718-21728},
  doi       = {10.1109/ICCV51070.2023.01985},
  url       = {https://mlanthology.org/iccv/2023/ganz2023iccv-models/}
}