The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer

Abstract

In this paper, we propose a hypothesis which posits that CLIP disentangles compositional visual attributes into orthogonal, independent subspaces which CLIP uses to build compositional representations of images. Our hypothesis suggests that CLIP learns compositional techniques that are similar to humans'. We find five core compositional attributes predicted by the hypothesis: color, size, counting, camera view, and pattern. We empirically test their properties and find that they code for their respective compositional attribute type and are essentially orthogonal to one another, as well as the subject of the image.

Cite

Text

Wolff et al. "The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer." ICLR 2023 Workshops: ME-FoMo, 2023.

Markdown

[Wolff et al. "The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer." ICLR 2023 Workshops: ME-FoMo, 2023.](https://mlanthology.org/iclrw/2023/wolff2023iclrw-independent/)

BibTeX

@inproceedings{wolff2023iclrw-independent,
  title     = {{The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer}},
  author    = {Wolff, Max and Brendel, Wieland and Wolff, Stuart},
  booktitle = {ICLR 2023 Workshops: ME-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/wolff2023iclrw-independent/}
}