Task-Aware Cross-Modal Feature Refinement Transformer with Large Language Models for Visual Grounding

Abstract

The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.

Cite

Text

Chen et al. "Task-Aware Cross-Modal Feature Refinement Transformer with Large Language Models for Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00372

Markdown

[Chen et al. "Task-Aware Cross-Modal Feature Refinement Transformer with Large Language Models for Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/chen2025cvpr-taskaware/) doi:10.1109/CVPR52734.2025.00372

BibTeX

@inproceedings{chen2025cvpr-taskaware,
  title     = {{Task-Aware Cross-Modal Feature Refinement Transformer with Large Language Models for Visual Grounding}},
  author    = {Chen, Wenbo and Xu, Zhen and Xu, Ruotao and Wu, Si and Wong, Hau-San},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3931-3941},
  doi       = {10.1109/CVPR52734.2025.00372},
  url       = {https://mlanthology.org/cvpr/2025/chen2025cvpr-taskaware/}
}