Generative Self-Training Improves Pre-Training for Visual Dialog

Abstract

Visual dialog (VisDial) is a task of answering a series of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog models solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for VisDial, called Generative Self-Training (GST), to enhance the pre-training. Specifically, GST generates synthetic dialog data for unlabeled images via multimodal conditional text generation and trains the dialog model on the synthetic and the original VisDial data. Moreover, we also propose perplexity-based data selection and multimodal consistency regularization for robust training of the synthetic data. Evaluation on VisDial v1.0 dataset shows that GST improves the pre-training and achieves new state-of-the-art results.

Cite

Text

Kang et al. "Generative Self-Training Improves Pre-Training for Visual Dialog." ICML 2022 Workshops: Pre-Training, 2022.

Markdown

[Kang et al. "Generative Self-Training Improves Pre-Training for Visual Dialog." ICML 2022 Workshops: Pre-Training, 2022.](https://mlanthology.org/icmlw/2022/kang2022icmlw-generative/)

BibTeX

@inproceedings{kang2022icmlw-generative,
  title     = {{Generative Self-Training Improves Pre-Training for Visual Dialog}},
  author    = {Kang, Gi-Cheon and Kim, Sungdong and Kim, Jin-Hwa and Kwak, Donghyun and Zhang, Byoung-Tak},
  booktitle = {ICML 2022 Workshops: Pre-Training},
  year      = {2022},
  url       = {https://mlanthology.org/icmlw/2022/kang2022icmlw-generative/}
}