Topological Planning with Transformers for Vision-and-Language Navigation

Abstract

Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Cite

Text

Chen et al. "Topological Planning with Transformers for Vision-and-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01112

Markdown

[Chen et al. "Topological Planning with Transformers for Vision-and-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/chen2021cvpr-topological/) doi:10.1109/CVPR46437.2021.01112

BibTeX

@inproceedings{chen2021cvpr-topological,
  title     = {{Topological Planning with Transformers for Vision-and-Language Navigation}},
  author    = {Chen, Kevin and Chen, Junshen K. and Chuang, Jo and Vazquez, Marynel and Savarese, Silvio},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {11276-11286},
  doi       = {10.1109/CVPR46437.2021.01112},
  url       = {https://mlanthology.org/cvpr/2021/chen2021cvpr-topological/}
}