A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models

Abstract

Despite the promising capabilities of vision-language models (VLMs) in diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Vision Transformer (ViT) model trained for multi-label classification on natural images and a Hierarchical Vision Transformer (H-ViT) trained for prostate cancer grading in Whole-Slide Images (WSIs). Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing task-specific and contrastive objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy. Our code is available on https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git.

Cite

Text

Lefkes et al. "A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models." Medical Imaging with Deep Learning, 2025.

Markdown

[Lefkes et al. "A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models." Medical Imaging with Deep Learning, 2025.](https://mlanthology.org/midl/2025/lefkes2025midl-balancing/)

BibTeX

@inproceedings{lefkes2025midl-balancing,
  title     = {{A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models}},
  author    = {Lefkes, Judith and Grisi, Clément and Litjens, Geert},
  booktitle = {Medical Imaging with Deep Learning},
  year      = {2025},
  url       = {https://mlanthology.org/midl/2025/lefkes2025midl-balancing/}
}