Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Abstract

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geolocalization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

Cite

Text

Chu et al. "Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73247-8_13

Markdown

[Chu et al. "Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chu2024eccv-natural/) doi:10.1007/978-3-031-73247-8_13

BibTeX

@inproceedings{chu2024eccv-natural,
  title     = {{Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}},
  author    = {Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73247-8_13},
  url       = {https://mlanthology.org/eccv/2024/chu2024eccv-natural/}
}