TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

Abstract

We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object. The data contains 9326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays a rich use of spatial reasoning. Empirical analysis shows the data presents an open challenge to existing methods.

Cite

Text

Chen et al. "TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.01282

Markdown

[Chen et al. "TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/chen2019cvpr-touchdown/) doi:10.1109/CVPR.2019.01282

BibTeX

@inproceedings{chen2019cvpr-touchdown,
  title     = {{TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments}},
  author    = {Chen, Howard and Suhr, Alane and Misra, Dipendra and Snavely, Noah and Artzi, Yoav},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.01282},
  url       = {https://mlanthology.org/cvpr/2019/chen2019cvpr-touchdown/}
}