TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
Abstract
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object. The data contains 9326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays a rich use of spatial reasoning. Empirical analysis shows the data presents an open challenge to existing methods.
Cite
Text
Chen et al. "TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.01282Markdown
[Chen et al. "TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/chen2019cvpr-touchdown/) doi:10.1109/CVPR.2019.01282BibTeX
@inproceedings{chen2019cvpr-touchdown,
title = {{TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments}},
author = {Chen, Howard and Suhr, Alane and Misra, Dipendra and Snavely, Noah and Artzi, Yoav},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2019},
doi = {10.1109/CVPR.2019.01282},
url = {https://mlanthology.org/cvpr/2019/chen2019cvpr-touchdown/}
}