Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Huy, Ta Duc; Huynh, Duy Anh; Xie, Yutong; Qi, Yuankai; Chen, Qi; Le Nguyen, Phi; Tran, Sen Kim; Phung, Son Lam; van den Hengel, Anton; Liao, Zhibin; To, Minh-Son; Verjans, Johan W.; Phan, Vu Minh Hieu

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Ta Duc Huy, Duy Anh Huynh, Yutong Xie, Yuankai Qi, Qi Chen, Phi Le Nguyen, Sen Kim Tran, Son Lam Phung, Anton van den Hengel, Zhibin Liao, Minh-Son To, Johan W. Verjans, Vu Minh Hieu Phan

ICCV 2025 pp. 24445-24455

/iccv/2025/huy2025iccv-seeing/

Abstract

Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.

PDF ICCV Semantic Scholar

Cite

Text

Huy et al. "Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding." International Conference on Computer Vision, 2025.

Markdown

[Huy et al. "Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/huy2025iccv-seeing/)

BibTeX

@inproceedings{huy2025iccv-seeing,
  title     = {{Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding}},
  author    = {Huy, Ta Duc and Huynh, Duy Anh and Xie, Yutong and Qi, Yuankai and Chen, Qi and Le Nguyen, Phi and Tran, Sen Kim and Phung, Son Lam and van den Hengel, Anton and Liao, Zhibin and To, Minh-Son and Verjans, Johan W. and Phan, Vu Minh Hieu},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24445-24455},
  url       = {https://mlanthology.org/iccv/2025/huy2025iccv-seeing/}
}