MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval

Zou, Yuanhao; Yin, Zhaozheng

MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval

CVPRW 2025 pp. 180-190

/cvprw/2025/zou2025cvprw-mvcm/

Abstract

Recent advancements in medical vision-language tasks, such as Medical Visual Question Answering (Med-VQA) and Medical Image-Text Retrieval (Med-ITR), aim to jointly learn from images and texts. However, two main issues persist in the field: the neglect of multi-view medical images and incomplete cross-modality understanding. Current studies often treat each image-text pair as independent instances (i.e., at the instance-level), neglecting the comprehensive contextual information available from multi-view images of the same study. Although some methods have explored refined alignments, combining alignment of global representation with the token-wise alignment of local representations, they often utilize only a uni-modality encoder (e.g., visual encoder) for downstream applications, lacking comprehensive cross-modality understanding. To address these issues, this paper introduces a framework MVCM that supports Multi-View and Cross-Modality alignment for Med-VQA and Med-ITR tasks. Our proposed method fully utilizes multi-view images in radiology datasets and aligns them at the study-level. We also employ various pretext tasks to support cross-modality alignment. We fine-tune the proposed model on downstream tasks Med-VQA and Med-ITR, outperforming state-of-the-art methods across multiple datasets. The code will be publicly available.

PDF CVPRW Semantic Scholar

Cite

Text

Zou and Yin. "MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Zou and Yin. "MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/zou2025cvprw-mvcm/)

BibTeX

@inproceedings{zou2025cvprw-mvcm,
  title     = {{MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval}},
  author    = {Zou, Yuanhao and Yin, Zhaozheng},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {180-190},
  url       = {https://mlanthology.org/cvprw/2025/zou2025cvprw-mvcm/}
}