Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation
Abstract
Aerial Vision-Dialog Navigation (AVDN) is a new task that requires drones to navigate to a target location based on human-robot dialog history. This paper focuses on the critical fine-grained cross-modal alignment problem in AVDN, requiring the drone to align language entities with visual landmarks in top-down views. To achieve this, we first construct a Fine-Grained AVDN (FG-AVDN) dataset via a semi-automatic annotation pipeline, providing diverse multimodal annotations at the entity-landmark level. Based on this, a novel Fine-grained Entity-Landmark Alignment (FELA) method is proposed to learn the cross-modal alignment explicitly. Concretely, FELA first boosts the drone's visual understanding with a precise semantic grid representation, which captures the environmental semantics and spatial structure simultaneously. Subsequently, to learn the entity-landmark alignment, we devise cross-modal auxiliary tasks from three perspectives, including grounding, captioning, and contrastive learning. Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts. Code and dataset will be publicly available.
Cite
Text
Su et al. "Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32758Markdown
[Su et al. "Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/su2025aaai-learning/) doi:10.1609/AAAI.V39I7.32758BibTeX
@inproceedings{su2025aaai-learning,
title = {{Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation}},
author = {Su, Yifei and An, Dong and Chen, Kehan and Yu, Weichen and Ning, Baiyang and Ling, Yonggen and Huang, Yan and Wang, Liang},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {7060-7068},
doi = {10.1609/AAAI.V39I7.32758},
url = {https://mlanthology.org/aaai/2025/su2025aaai-learning/}
}