Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval
Abstract
Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.
Cite
Text
Ding et al. "Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Ding et al. "Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ding2025icml-visual/)BibTeX
@inproceedings{ding2025icml-visual,
title = {{Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval}},
author = {Ding, Guofeng and Lu, Yiding and Hu, Peng and Yang, Mouxing and Lin, Yijie and Peng, Xi},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {13825-13844},
volume = {267},
url = {https://mlanthology.org/icml/2025/ding2025icml-visual/}
}