Unsupervised Textual Grounding: Linking Words to Image Concepts

Abstract

Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

Cite

Text

Yeh et al. "Unsupervised Textual Grounding: Linking Words to Image Concepts." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. doi:10.1109/CVPR.2018.00641

Markdown

[Yeh et al. "Unsupervised Textual Grounding: Linking Words to Image Concepts." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.](https://mlanthology.org/cvpr/2018/yeh2018cvpr-unsupervised/) doi:10.1109/CVPR.2018.00641

BibTeX

@inproceedings{yeh2018cvpr-unsupervised,
  title     = {{Unsupervised Textual Grounding: Linking Words to Image Concepts}},
  author    = {Yeh, Raymond A. and Do, Minh N. and Schwing, Alexander G.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2018},
  doi       = {10.1109/CVPR.2018.00641},
  url       = {https://mlanthology.org/cvpr/2018/yeh2018cvpr-unsupervised/}
}