Image Clustering Conditioned on Text Criteria

Abstract

Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified criteria in the form of text by leveraging modern Vision-Language Models and Large Language Models. We call our method Image Clustering Conditioned on Text Criteria (IC$|$TC), and it represents a different paradigm of image clustering. IC$|$TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC$|$TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, significantly outperforming baselines.

Cite

Text

Kwon et al. "Image Clustering Conditioned on Text Criteria." International Conference on Learning Representations, 2024.

Markdown

[Kwon et al. "Image Clustering Conditioned on Text Criteria." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/kwon2024iclr-image/)

BibTeX

@inproceedings{kwon2024iclr-image,
  title     = {{Image Clustering Conditioned on Text Criteria}},
  author    = {Kwon, Sehyun and Park, Jaeseung and Kim, Minkyu and Cho, Jaewoong and Ryu, Ernest K. and Lee, Kangwook},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/kwon2024iclr-image/}
}