UniT: Multimodal Multitask Learning with a Unified Transformer

Abstract

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.

Cite

Text

Hu and Singh. "UniT: Multimodal Multitask Learning with a Unified Transformer." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00147

Markdown

[Hu and Singh. "UniT: Multimodal Multitask Learning with a Unified Transformer." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/hu2021iccv-unit/) doi:10.1109/ICCV48922.2021.00147

BibTeX

@inproceedings{hu2021iccv-unit,
  title     = {{UniT: Multimodal Multitask Learning with a Unified Transformer}},
  author    = {Hu, Ronghang and Singh, Amanpreet},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {1439-1449},
  doi       = {10.1109/ICCV48922.2021.00147},
  url       = {https://mlanthology.org/iccv/2021/hu2021iccv-unit/}
}