Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Abstract

Dual encoder architectures like Clip models map two types of inputs into a shared em- bedding space and predict similarities between them. Despite their wide application, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders, whose predictions depend on interactions be- tween features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. This intrinsic visual-linguistic grounding ability, however, varies heavily between object classes, exhibits pronounced out-of-domain effects and we can identify individual errors as well as systematic failure categories. Code is publicly available: https://github.com/lucasmllr/exCLIP

Cite

Text

Moeller et al. "Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions." Transactions on Machine Learning Research, 2025.

Markdown

[Moeller et al. "Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/moeller2025tmlr-explaining/)

BibTeX

@article{moeller2025tmlr-explaining,
  title     = {{Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions}},
  author    = {Moeller, Lucas and Tilli, Pascal and Vu, Thang and Padó, Sebastian},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/moeller2025tmlr-explaining/}
}