Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

Wan, Bo; Han, Wenjuan; Zheng, Zilong; Tuytelaars, Tinne

Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

Bo Wan, Wenjuan Han, Zilong Zheng, Tinne Tuytelaars

ICLR 2022

/iclr/2022/wan2022iclr-unsupervised/

Abstract

We introduce a new task, unsupervised vision-language (VL) grammar induction. Given an image-caption pair, the goal is to extract a shared hierarchical structure for both image and language simultaneously. We argue that such structured output, grounded in both modalities, is a clear step towards the high-level understanding of multimodal information. Besides challenges existing in conventional visually grounded grammar induction tasks, VL grammar induction requires a model to capture contextual semantics and perform a fine-grained alignment. To address these challenges, we propose a novel method, CLIORA, which constructs a shared vision-language constituency tree structure with context-dependent semantics for all possible phrases in different levels of the tree. It computes a matching score between each constituent and image region, trained via contrastive learning. It integrates two levels of fusion, namely at feature-level and at score-level, so as to allow fine-grained alignment. We introduce a new evaluation metric for VL grammar induction, CCRA, and show a 3.3% improvement over a strong baseline on Flickr30k Entities. We also evaluate our model via two derived tasks, i.e., language grammar induction and phrase grounding, and improve over the state-of-the-art for both.

PDF ICLR Semantic Scholar

Cite

Text

Wan et al. "Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling." International Conference on Learning Representations, 2022.

Markdown

[Wan et al. "Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/wan2022iclr-unsupervised/)

BibTeX

@inproceedings{wan2022iclr-unsupervised,
  title     = {{Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling}},
  author    = {Wan, Bo and Han, Wenjuan and Zheng, Zilong and Tuytelaars, Tinne},
  booktitle = {International Conference on Learning Representations},
  year      = {2022},
  url       = {https://mlanthology.org/iclr/2022/wan2022iclr-unsupervised/}
}