Interpreting Object-Level Foundation Models via Visual Precision Search

Abstract

Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models' decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7%, 31.6%, and 20.1% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 50.7% and 66.9% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code is released at https://github.com/RuoyuChen10/VPS.

Cite

Text

Chen et al. "Interpreting Object-Level Foundation Models via Visual Precision Search." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02796

Markdown

[Chen et al. "Interpreting Object-Level Foundation Models via Visual Precision Search." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/chen2025cvpr-interpreting/) doi:10.1109/CVPR52734.2025.02796

BibTeX

@inproceedings{chen2025cvpr-interpreting,
  title     = {{Interpreting Object-Level Foundation Models via Visual Precision Search}},
  author    = {Chen, Ruoyu and Liang, Siyuan and Li, Jingzhi and Liu, Shiming and Li, Maosen and Huang, Zhen and Zhang, Hua and Cao, Xiaochun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {30042-30052},
  doi       = {10.1109/CVPR52734.2025.02796},
  url       = {https://mlanthology.org/cvpr/2025/chen2025cvpr-interpreting/}
}