Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework

Abstract

We present Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM), a framework that advances vision-language matching and retrieval by leveraging a large language model (LLM) backbone. While concurrent LLM-based approaches like VLM2VEC, MM-Embed, NV-Embed, and MM-GEM have demonstrated impressive capabilities in multi-modal and multi-task scenarios, our work introduces novel mechanisms for task-adaptive learning and embedding extraction that further enhance the potential of LLM-based retrieval systems. Our key technical contribution lies in the development of a task-aware contrastive learning framework with an automated Bayesian weighing mechanism. This approach provides a principled way to balance multiple tasks during training, departing from conventional contrastive learning strategies. We further enhance the framework through a multiple-token summarization strategy and an auxiliary language modeling objective, which together significantly improve retrieval performance.Comprehensive experiments on M-BEIR and ICinW benchmarks demonstrate M3T-UEM's effectiveness, showing competitive or superior performance compared to both traditional encoder-based methods and recent LLM-based approaches. Furthermore, we demonstrate particular strengths in handling compositional conceptual changes and multilingual scenarios owing to the incorporation of an LLM backbone where the method drastically outperforms CLIP in zero-shot settings, often by orders of magnitude.

Cite

Text

Sharma et al. "Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework." International Conference on Computer Vision, 2025.

Markdown

[Sharma et al. "Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sharma2025iccv-multimodal/)

BibTeX

@inproceedings{sharma2025iccv-multimodal,
  title     = {{Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework}},
  author    = {Sharma, Rohan and Chen, Changyou and Chang, Feng-Ju and Yun, Seongjun and Xie, Xiaohu and Meng, Rui and Xu, Dehong and Mottini, Alejandro and Cui, Qingjun},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {22783-22793},
  url       = {https://mlanthology.org/iccv/2025/sharma2025iccv-multimodal/}
}