Towards Diverse and Natural Image Descriptions via a Conditional GAN

Abstract

Despite the substantial progress in recent years, the problem of image captioning remains far from being satisfactorily tackled. Sentences produced by existing methods, e.g. those based on LSTM, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages the high resemblance to the "ground-truths", while suppressing other reasonable expressions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with an aim to improve the naturalness and diversity - two essential properties of human expressions. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedbacks along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.

Cite

Text

Dai et al. "Towards Diverse and Natural Image Descriptions via a Conditional GAN." International Conference on Computer Vision, 2017. doi:10.1109/ICCV.2017.323

Markdown

[Dai et al. "Towards Diverse and Natural Image Descriptions via a Conditional GAN." International Conference on Computer Vision, 2017.](https://mlanthology.org/iccv/2017/dai2017iccv-diverse/) doi:10.1109/ICCV.2017.323

BibTeX

@inproceedings{dai2017iccv-diverse,
  title     = {{Towards Diverse and Natural Image Descriptions via a Conditional GAN}},
  author    = {Dai, Bo and Fidler, Sanja and Urtasun, Raquel and Lin, Dahua},
  booktitle = {International Conference on Computer Vision},
  year      = {2017},
  doi       = {10.1109/ICCV.2017.323},
  url       = {https://mlanthology.org/iccv/2017/dai2017iccv-diverse/}
}