Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation

Abstract

Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

Cite

Text

Cui et al. "Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01106

Markdown

[Cui et al. "Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/cui2023iccv-grounded/) doi:10.1109/ICCV51070.2023.01106

BibTeX

@inproceedings{cui2023iccv-grounded,
  title     = {{Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation}},
  author    = {Cui, Yibo and Xie, Liang and Zhang, Yakun and Zhang, Meishan and Yan, Ye and Yin, Erwei},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {12043-12053},
  doi       = {10.1109/ICCV51070.2023.01106},
  url       = {https://mlanthology.org/iccv/2023/cui2023iccv-grounded/}
}