Measuring Human-CLIP Alignment at Different Abstraction Levels

Abstract

Measuring the human alignment of trained models is gaining traction because it is not clear to which extent artificial image representations are proper models of the visual brain. Employing the CLIP model and some of its variants as a case study, we showcase the importance of using different abstraction levels in the experiments, because when measuring image distances, the differences between them can have lower or higher abstraction. This allows us to extract richer conclusions about the models while showing some interesting phenomena arising when analyzing the models in a depth-wise fashion. The conclusions extracted from our analysis identify the size of the patches in which the image is divided as the most important factor in achieving a high human alignment for all the abstraction levels. We found that the method used to compute the model distances is crucial to avoid alignment drops. Moreover, replacing the usual softmax activation with a sigmoid also increases the human alignment at all the abstractions especially in the last model layers. Surprisingly, training the model with Chinese captions or medical data gives more human-aligned models but only on low abstraction levels.

Cite

Text

Hernández-Cámara et al. "Measuring Human-CLIP Alignment at Different Abstraction Levels." ICLR 2024 Workshops: Re-Align, 2024.

Markdown

[Hernández-Cámara et al. "Measuring Human-CLIP Alignment at Different Abstraction Levels." ICLR 2024 Workshops: Re-Align, 2024.](https://mlanthology.org/iclrw/2024/hernandezcamara2024iclrw-measuring/)

BibTeX

@inproceedings{hernandezcamara2024iclrw-measuring,
  title     = {{Measuring Human-CLIP Alignment at Different Abstraction Levels}},
  author    = {Hernández-Cámara, Pablo and Vila-Tomás, Jorge and Malo, Jesus and Laparra, Valero},
  booktitle = {ICLR 2024 Workshops: Re-Align},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/hernandezcamara2024iclrw-measuring/}
}