Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Abstract

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground referenced scene elements referenced (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant `disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn the visual groundings that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a trajectory of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further gains.

Cite

Text

Majumdar et al. "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58539-6_16

Markdown

[Majumdar et al. "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/majumdar2020eccv-improving/) doi:10.1007/978-3-030-58539-6_16

BibTeX

@inproceedings{majumdar2020eccv-improving,
  title     = {{Improving Vision-and-Language Navigation with Image-Text Pairs from the Web}},
  author    = {Majumdar, Arjun and Shrivastava, Ayush and Lee, Stefan and Anderson, Peter and Parikh, Devi and Batra, Dhruv},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58539-6_16},
  url       = {https://mlanthology.org/eccv/2020/majumdar2020eccv-improving/}
}