Reinforced Structured State-Evolution for Vision-Language Navigation

Abstract

Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction. Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator. In such a paradigm, the sequence model predicts action at each step through a maintained navigation state, which is generally represented as a one-dimensional vector. However, the crucial navigation clues (i.e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured. In this paper, we propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN. Specifically, we utilise the graph-based feature to represent the navigation state instead of the vector-based state. Accordingly, we devise a Reinforced Layout clues Miner (RLM) to mine and detect the most crucial layout graph for long-term navigation via a customised reinforcement learning strategy. Moreover, the Structured Evolving Module (SEM) is proposed to maintain the structured graph-based state during navigation, where the state is gradually evolved to learn the object-level spatial-temporal relationship. The experiments on the R2R and R4R datasets show that the proposed SEvol model improves VLN models' performance by large margins, e.g., +3% absolute SPL accuracy for NvEM and +8% for EnvDrop on the R2R test set.

Cite

Text

Chen et al. "Reinforced Structured State-Evolution for Vision-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01501

Markdown

[Chen et al. "Reinforced Structured State-Evolution for Vision-Language Navigation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/chen2022cvpr-reinforced/) doi:10.1109/CVPR52688.2022.01501

BibTeX

@inproceedings{chen2022cvpr-reinforced,
  title     = {{Reinforced Structured State-Evolution for Vision-Language Navigation}},
  author    = {Chen, Jinyu and Gao, Chen and Meng, Erli and Zhang, Qiong and Liu, Si},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {15450-15459},
  doi       = {10.1109/CVPR52688.2022.01501},
  url       = {https://mlanthology.org/cvpr/2022/chen2022cvpr-reinforced/}
}