VIRTUE: Visual-Interactive Text-Image Universal Embedder
Abstract
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interests from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel **V**isual-**I**nte**R**active **T**ext-Image **U**niversal **E**mbedder (**VIRTUE**) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale **S**egmentation-and-Scene **Ca**ption **R**etrieval (**SCaR**) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (**3.1\%–8.5\%**) and five visual-interactive SCaR (**15.2\%–20.3\%**) tasks. The code, models, and benchmarks are available at https://github.com/sony/virtue.
Cite
Text
Wang et al. "VIRTUE: Visual-Interactive Text-Image Universal Embedder." International Conference on Learning Representations, 2026.Markdown
[Wang et al. "VIRTUE: Visual-Interactive Text-Image Universal Embedder." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-virtue/)BibTeX
@inproceedings{wang2026iclr-virtue,
title = {{VIRTUE: Visual-Interactive Text-Image Universal Embedder}},
author = {Wang, Wei-Yao and Tateishi, Kazuya and Wu, Qiyu and Takahashi, Shusuke and Mitsufuji, Yuki},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wang2026iclr-virtue/}
}