VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning

Abstract

The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO and 17.9% CIDEr on Conceptual Captions. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray, a medical report generation dataset. Our code is available at https://github.com/Vision-CAIR/VisualGPT.

Cite

Text

Chen et al. "VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01750

Markdown

[Chen et al. "VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/chen2022cvpr-visualgpt/) doi:10.1109/CVPR52688.2022.01750

BibTeX

@inproceedings{chen2022cvpr-visualgpt,
  title     = {{VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning}},
  author    = {Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {18030-18040},
  doi       = {10.1109/CVPR52688.2022.01750},
  url       = {https://mlanthology.org/cvpr/2022/chen2022cvpr-visualgpt/}
}