Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images

Abstract

Visual grounding (VG) refers to detecting the specific objects in images based on linguistic expressions, and it has profound significance in the advanced interpretation of natural images. In remote sensing image interpretation, visual grounding is limited by characteristics such as the complex scenes and diverse object sizes. To solve this problem, we propose a novel remote sensing visual grounding (RSVG) framework, named language-guided hybrid representation learning Transformer (LGFormer). Specifically, we designed a multimodal dual-encoder Transformer structure called the adaptive multimodal feature fusion module. This structure innovatively integrates text and visual features as hybrid queries, enabling early-stage decoding queries to perceive the target position accurately. Then, the different modal information from the dual encoders is aggregated by hybrid queries to obtain the final object embedding for coordinate regression. Besides, a multi-scale cross-modal feature enhancement module (MSCM) is designed to enhance the self-representation of the extracted text and visual features and align them semantically. As for the hybrid queries, we use linguistic guidance to select visual features as the visual part and sentence-level features as the textual part. Finally, the LGFormer model we designed achieved the best results compared to existing models on the DIOR-RSVG and OPT-RSVG datasets.

Cite

Text

Liu et al. "Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/174

Markdown

[Liu et al. "Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/liu2025ijcai-language/) doi:10.24963/IJCAI.2025/174

BibTeX

@inproceedings{liu2025ijcai-language,
  title     = {{Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images}},
  author    = {Liu, Biao and Liu, Xu and Li, Lingling and Jiao, Licheng and Liu, Fang and Sun, Xinyu and Huang, Youlin},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {1557-1566},
  doi       = {10.24963/IJCAI.2025/174},
  url       = {https://mlanthology.org/ijcai/2025/liu2025ijcai-language/}
}