Auto-Parsing Network for Image Captioning and Visual Question Answering

Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai

ICCV 2021 pp. 2197-2207

doi:10.1109/ICCV48922.2021.00220 /iccv/2021/yang2021iccv-autoparsing/

Abstract

We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.

PDF ICCV Semantic Scholar

Cite

Text

Yang et al. "Auto-Parsing Network for Image Captioning and Visual Question Answering." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00220

Markdown

[Yang et al. "Auto-Parsing Network for Image Captioning and Visual Question Answering." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yang2021iccv-autoparsing/) doi:10.1109/ICCV48922.2021.00220

BibTeX

@inproceedings{yang2021iccv-autoparsing,
  title     = {{Auto-Parsing Network for Image Captioning and Visual Question Answering}},
  author    = {Yang, Xu and Gao, Chongyang and Zhang, Hanwang and Cai, Jianfei},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {2197-2207},
  doi       = {10.1109/ICCV48922.2021.00220},
  url       = {https://mlanthology.org/iccv/2021/yang2021iccv-autoparsing/}
}