Captioning Images Taken by People Who Are Blind
Abstract
While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org.
Cite
Text
Gurari et al. "Captioning Images Taken by People Who Are Blind." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58520-4_25Markdown
[Gurari et al. "Captioning Images Taken by People Who Are Blind." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/gurari2020eccv-captioning/) doi:10.1007/978-3-030-58520-4_25BibTeX
@inproceedings{gurari2020eccv-captioning,
title = {{Captioning Images Taken by People Who Are Blind}},
author = {Gurari, Danna and Zhao, Yinan and Zhang, Meng and Bhattacharya, Nilavra},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2020},
doi = {10.1007/978-3-030-58520-4_25},
url = {https://mlanthology.org/eccv/2020/gurari2020eccv-captioning/}
}