Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or "imaginations", we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of ~1 point and up to ~0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.

Cite

Text

Perincherry et al. "Do Visual Imaginations Improve Vision-and-Language Navigation Agents?." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00364

Markdown

[Perincherry et al. "Do Visual Imaginations Improve Vision-and-Language Navigation Agents?." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/perincherry2025cvpr-visual/) doi:10.1109/CVPR52734.2025.00364

BibTeX

@inproceedings{perincherry2025cvpr-visual,
  title     = {{Do Visual Imaginations Improve Vision-and-Language Navigation Agents?}},
  author    = {Perincherry, Akhil and Krantz, Jacob and Lee, Stefan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3846-3855},
  doi       = {10.1109/CVPR52734.2025.00364},
  url       = {https://mlanthology.org/cvpr/2025/perincherry2025cvpr-visual/}
}