Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Abstract

We extend the popular transformer architecture to a multi-modal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

Cite

Text

Guo et al. "Global Fusion Attention for Vision and Language Understanding (Student Abstract)." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I18.17891

Markdown

[Guo et al. "Global Fusion Attention for Vision and Language Understanding (Student Abstract)." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/guo2021aaai-global/) doi:10.1609/AAAI.V35I18.17891

BibTeX

@inproceedings{guo2021aaai-global,
  title     = {{Global Fusion Attention for Vision and Language Understanding (Student Abstract)}},
  author    = {Guo, Zixin and Liang, Chen and Wan, Ziyu and Bai, Yang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {15789-15790},
  doi       = {10.1609/AAAI.V35I18.17891},
  url       = {https://mlanthology.org/aaai/2021/guo2021aaai-global/}
}