VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Abstract

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the need for expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

Cite

Text

Pramanick et al. "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment." Transactions on Machine Learning Research, 2023.

Markdown

[Pramanick et al. "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/pramanick2023tmlr-volta/)

BibTeX

@article{pramanick2023tmlr-volta,
  title     = {{VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment}},
  author    = {Pramanick, Shraman and Jing, Li and Nag, Sayan and Zhu, Jiachen and Shah, Hardik J and LeCun, Yann and Chellappa, Rama},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/pramanick2023tmlr-volta/}
}