An Image Is Worth Multiple Words: Discovering Object Level Concepts Using Multi-Concept Prompt Learning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Alexander Teare

ICML 2024 pp. 22210-22243

/icml/2024/jin2024icml-image/

Abstract

Textural Inversion, a prompt learning method, learns a singular text embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying multiple unknown object-level concepts within one scene remains a complex challenge. While recent methods have resorted to cropping or masking individual images to learn multiple concepts, these techniques often require prior knowledge of new concepts and are labour-intensive. To address this challenge, we introduce Multi-Concept Prompt Learning (MCPL), where multiple unknown "words" are simultaneously learned from a single sentence-image pair, without any imagery annotations. To enhance the accuracy of word-concept correlation and refine attention mask boundaries, we propose three regularisation techniques: Attention Masking, Prompts Contrastive Loss, and Bind Adjective. Extensive quantitative comparisons with both real-world categories and biomedical images demonstrate that our method can learn new semantically disentangled concepts. Our approach emphasises learning solely from textual embeddings, using less than 10% of the storage space compared to others. The project page, code, and data are available at https://astrazeneca.github.io/mcpl.github.io.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Jin et al. "An Image Is Worth Multiple Words: Discovering Object Level Concepts Using Multi-Concept Prompt Learning." International Conference on Machine Learning, 2024.

Markdown

[Jin et al. "An Image Is Worth Multiple Words: Discovering Object Level Concepts Using Multi-Concept Prompt Learning." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/jin2024icml-image/)

BibTeX

@inproceedings{jin2024icml-image,
  title     = {{An Image Is Worth Multiple Words: Discovering Object Level Concepts Using Multi-Concept Prompt Learning}},
  author    = {Jin, Chen and Tanno, Ryutaro and Saseendran, Amrutha and Diethe, Tom and Teare, Philip Alexander},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {22210-22243},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/jin2024icml-image/}
}