Knowledge-Aware Image Understanding with Multi-Level Visual Representation Enhancement for Visual Question Answering

Abstract

Existing visual question answering (VQA) methods tend to focus excessively on visual objects in images, neglecting the understanding of implicit knowledge within the images, thus limiting the comprehension of image content. Furthermore, current mainstream VQA methods employ a bottom-up attention mechanism, which was initially proposed in 2017 and has become a bottleneck in visual question answering. In order to address the aforementioned issues and improve the ability to understand images, we have made the following improvements and innovations: (1) We utilize an OCR model to detect and extract scene text in the images, further enriching the understanding of image content. And we introduce the descriptive information from the images to enhance the model’s comprehension of the images. (2) We have made improvements to the bottom-up attention model by obtaining two region features from the images, we concatenate the two region features to form the final visual feature, which better represents the image. (3) We design an extensible deep co-attention model, which includes self-attention units and co-attention units. This model can incorporate both image description information and scene text into the model, and it can be extended with other knowledge to further enhance the model’s reasoning ability. (4) Experimental results demonstrate that our best single model achieves an overall accuracy of 74.38% on the VQA 2.0 test set. To the best of our knowledge, without using external datasets for pretraining, our model has reached a state-of-the-art level.

Cite

Text

Yan et al. "Knowledge-Aware Image Understanding with Multi-Level Visual Representation Enhancement for Visual Question Answering." Machine Learning, 2024. doi:10.1007/S10994-023-06426-6

Markdown

[Yan et al. "Knowledge-Aware Image Understanding with Multi-Level Visual Representation Enhancement for Visual Question Answering." Machine Learning, 2024.](https://mlanthology.org/mlj/2024/yan2024mlj-knowledgeaware/) doi:10.1007/S10994-023-06426-6

BibTeX

@article{yan2024mlj-knowledgeaware,
  title     = {{Knowledge-Aware Image Understanding with Multi-Level Visual Representation Enhancement for Visual Question Answering}},
  author    = {Yan, Feng and Li, Zhe and Silamu, Wushour and Li, Yanbing},
  journal   = {Machine Learning},
  year      = {2024},
  pages     = {3789-3805},
  doi       = {10.1007/S10994-023-06426-6},
  volume    = {113},
  url       = {https://mlanthology.org/mlj/2024/yan2024mlj-knowledgeaware/}
}