ABC: Achieving Better Control of Visual Embeddings Using VLLMs

Benjamin Schneider, Florian Kerschbaum, Wenhu Chen

TMLR 2025

/tmlr/2025/schneider2025tmlr-abc/

Abstract

Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user in- struction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP- based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the repre- sentation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex- tual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/

PDF TMLR Semantic Scholar

Cite

Text

Schneider et al. "ABC: Achieving Better Control of Visual Embeddings Using VLLMs." Transactions on Machine Learning Research, 2025.

Markdown

[Schneider et al. "ABC: Achieving Better Control of Visual Embeddings Using VLLMs." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/schneider2025tmlr-abc/)

BibTeX

@article{schneider2025tmlr-abc,
  title     = {{ABC: Achieving Better Control of Visual Embeddings Using VLLMs}},
  author    = {Schneider, Benjamin and Kerschbaum, Florian and Chen, Wenhu},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/schneider2025tmlr-abc/}
}