Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Abstract

Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.

Cite

Text

Ding et al. "Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Ding et al. "Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ding2025icml-visual/)

BibTeX

@inproceedings{ding2025icml-visual,
  title     = {{Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval}},
  author    = {Ding, Guofeng and Lu, Yiding and Hu, Peng and Yang, Mouxing and Lin, Yijie and Peng, Xi},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {13825-13844},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/ding2025icml-visual/}
}