A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models
Abstract
Despite the promising capabilities of vision-language models (VLMs) in diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Vision Transformer (ViT) model trained for multi-label classification on natural images and a Hierarchical Vision Transformer (H-ViT) trained for prostate cancer grading in Whole-Slide Images (WSIs). Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing task-specific and contrastive objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy. Our code is available on https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git.
Cite
Text
Lefkes et al. "A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models." Medical Imaging with Deep Learning, 2025.Markdown
[Lefkes et al. "A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models." Medical Imaging with Deep Learning, 2025.](https://mlanthology.org/midl/2025/lefkes2025midl-balancing/)BibTeX
@inproceedings{lefkes2025midl-balancing,
title = {{A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models}},
author = {Lefkes, Judith and Grisi, Clément and Litjens, Geert},
booktitle = {Medical Imaging with Deep Learning},
year = {2025},
url = {https://mlanthology.org/midl/2025/lefkes2025midl-balancing/}
}