Vision-and-Language Navigation via Causal Learning

Abstract

In the pursuit of robust and generalizable environment perception and language understanding the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT) a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision language and history we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally to capture global confounder features we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R REVERIE RxR and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.

Cite

Text

Wang et al. "Vision-and-Language Navigation via Causal Learning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01248

Markdown

[Wang et al. "Vision-and-Language Navigation via Causal Learning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wang2024cvpr-visionandlanguage/) doi:10.1109/CVPR52733.2024.01248

BibTeX

@inproceedings{wang2024cvpr-visionandlanguage,
  title     = {{Vision-and-Language Navigation via Causal Learning}},
  author    = {Wang, Liuyi and He, Zongtao and Dang, Ronghao and Shen, Mengjiao and Liu, Chengju and Chen, Qijun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13139-13150},
  doi       = {10.1109/CVPR52733.2024.01248},
  url       = {https://mlanthology.org/cvpr/2024/wang2024cvpr-visionandlanguage/}
}