Diverse Beam Search for Improved Description of Complex Scenes
Abstract
A single image captures the appearance and position of multiple entities in a scene as well as their complex interactions. As a consequence, natural language grounded in visual contexts tends to be diverse---with utterances differing as focus shifts to specific objects, interactions, or levels of detail. Recently, neural sequence models such as RNNs and LSTMs have been employed to produce visually-grounded language. Beam Search, the standard work-horse for decoding sequences from these models, is an approximate inference algorithm that decodes the top-B sequences in a greedy left-to-right fashion. In practice, the resulting sequences are often minor rewordings of a common utterance, failing to capture the multimodal nature of source images. To address this shortcoming, we propose Diverse Beam Search (DBS), a diversity promoting alternative to BS for approximate inference. DBS produces sequences that are significantly different from each other by incorporating diversity constraints within groups of candidate sequences during decoding; moreover, it achieves this with minimal computational or memory overhead. We demonstrate that our method improves both diversity and quality of decoded sequences over existing techniques on two visually-grounded language generation tasks---image captioning and visual question generation---particularly on complex scenes containing diverse visual content. We also show similar improvements at language-only machine translation tasks, highlighting the generality of our approach.
Cite
Text
Vijayakumar et al. "Diverse Beam Search for Improved Description of Complex Scenes." AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/AAAI.V32I1.12340Markdown
[Vijayakumar et al. "Diverse Beam Search for Improved Description of Complex Scenes." AAAI Conference on Artificial Intelligence, 2018.](https://mlanthology.org/aaai/2018/vijayakumar2018aaai-diverse/) doi:10.1609/AAAI.V32I1.12340BibTeX
@inproceedings{vijayakumar2018aaai-diverse,
title = {{Diverse Beam Search for Improved Description of Complex Scenes}},
author = {Vijayakumar, Ashwin K. and Cogswell, Michael and Selvaraju, Ramprasaath R. and Sun, Qing and Lee, Stefan and Crandall, David J. and Batra, Dhruv},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2018},
pages = {7371-7379},
doi = {10.1609/AAAI.V32I1.12340},
url = {https://mlanthology.org/aaai/2018/vijayakumar2018aaai-diverse/}
}