SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Abstract
Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.
Cite
Text
Wang et al. "SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I5.20536Markdown
[Wang et al. "SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/wang2022aaai-sgeitl/) doi:10.1609/AAAI.V36I5.20536BibTeX
@inproceedings{wang2022aaai-sgeitl,
title = {{SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning}},
author = {Wang, Zhecan and You, Haoxuan and Li, Liunian Harold and Zareian, Alireza and Park, Suji and Liang, Yiqing and Chang, Kai-Wei and Chang, Shih-Fu},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2022},
pages = {5914-5922},
doi = {10.1609/AAAI.V36I5.20536},
url = {https://mlanthology.org/aaai/2022/wang2022aaai-sgeitl/}
}