Topological Planning with Transformers for Vision-and-Language Navigation
Abstract
Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
Cite
Text
Chen et al. "Topological Planning with Transformers for Vision-and-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01112Markdown
[Chen et al. "Topological Planning with Transformers for Vision-and-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/chen2021cvpr-topological/) doi:10.1109/CVPR46437.2021.01112BibTeX
@inproceedings{chen2021cvpr-topological,
title = {{Topological Planning with Transformers for Vision-and-Language Navigation}},
author = {Chen, Kevin and Chen, Junshen K. and Chuang, Jo and Vazquez, Marynel and Savarese, Silvio},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {11276-11286},
doi = {10.1109/CVPR46437.2021.01112},
url = {https://mlanthology.org/cvpr/2021/chen2021cvpr-topological/}
}