Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding

Abstract

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension / segmentation) has been widely explored. Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-art methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

PDF NeurIPS OpenReview Code Semantic Scholar

Cite

Text

Li and Sigal. "Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding." Neural Information Processing Systems, 2021.

Markdown

[Li and Sigal. "Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/li2021neurips-referring/)

BibTeX

@inproceedings{li2021neurips-referring,
  title     = {{Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding}},
  author    = {Li, Muchen and Sigal, Leonid},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/li2021neurips-referring/}
}