12-in-1: Multi-Task Vision and Language Representation Learning

Lu, Jiasen; Goswami, Vedanuj; Rohrbach, Marcus; Parikh, Devi; Lee, Stefan

doi:10.1109/CVPR42600.2020.01045

12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

CVPR 2020

doi:10.1109/CVPR42600.2020.01045 /cvpr/2020/lu2020cvpr-12in1/

Abstract

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task model. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

PDF CVPR Semantic Scholar

Cite

Text

Lu et al. "12-in-1: Multi-Task Vision and Language Representation Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01045

Markdown

[Lu et al. "12-in-1: Multi-Task Vision and Language Representation Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/lu2020cvpr-12in1/) doi:10.1109/CVPR42600.2020.01045

BibTeX

@inproceedings{lu2020cvpr-12in1,
  title     = {{12-in-1: Multi-Task Vision and Language Representation Learning}},
  author    = {Lu, Jiasen and Goswami, Vedanuj and Rohrbach, Marcus and Parikh, Devi and Lee, Stefan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01045},
  url       = {https://mlanthology.org/cvpr/2020/lu2020cvpr-12in1/}
}