Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation

Abstract

Aerial Vision-Dialog Navigation (AVDN) is a new task that requires drones to navigate to a target location based on human-robot dialog history. This paper focuses on the critical fine-grained cross-modal alignment problem in AVDN, requiring the drone to align language entities with visual landmarks in top-down views. To achieve this, we first construct a Fine-Grained AVDN (FG-AVDN) dataset via a semi-automatic annotation pipeline, providing diverse multimodal annotations at the entity-landmark level. Based on this, a novel Fine-grained Entity-Landmark Alignment (FELA) method is proposed to learn the cross-modal alignment explicitly. Concretely, FELA first boosts the drone's visual understanding with a precise semantic grid representation, which captures the environmental semantics and spatial structure simultaneously. Subsequently, to learn the entity-landmark alignment, we devise cross-modal auxiliary tasks from three perspectives, including grounding, captioning, and contrastive learning. Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts. Code and dataset will be publicly available.

Cite

Text

Su et al. "Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32758

Markdown

[Su et al. "Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/su2025aaai-learning/) doi:10.1609/AAAI.V39I7.32758

BibTeX

@inproceedings{su2025aaai-learning,
  title     = {{Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation}},
  author    = {Su, Yifei and An, Dong and Chen, Kehan and Yu, Weichen and Ning, Baiyang and Ling, Yonggen and Huang, Yan and Wang, Liang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {7060-7068},
  doi       = {10.1609/AAAI.V39I7.32758},
  url       = {https://mlanthology.org/aaai/2025/su2025aaai-learning/}
}