PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Abstract

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., *dog* $\preceq$ *mammal* $\preceq$ *animal*) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ *dog*, *car*). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose *PHyCLIP*, which employs an $\ell_1$-*P*roduct metric on a Cartesian product of *Hy*perbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

Cite

Text

Yoshikawa and Matsubara. "PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning." International Conference on Learning Representations, 2026.

Markdown

[Yoshikawa and Matsubara. "PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yoshikawa2026iclr-phyclip/)

BibTeX

@inproceedings{yoshikawa2026iclr-phyclip,
  title     = {{PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning}},
  author    = {Yoshikawa, Daiki and Matsubara, Takashi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yoshikawa2026iclr-phyclip/}
}