MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval
Abstract
Recent advancements in medical vision-language tasks, such as Medical Visual Question Answering (Med-VQA) and Medical Image-Text Retrieval (Med-ITR), aim to jointly learn from images and texts. However, two main issues persist in the field: the neglect of multi-view medical images and incomplete cross-modality understanding. Current studies often treat each image-text pair as independent instances (i.e., at the instance-level), neglecting the comprehensive contextual information available from multi-view images of the same study. Although some methods have explored refined alignments, combining alignment of global representation with the token-wise alignment of local representations, they often utilize only a uni-modality encoder (e.g., visual encoder) for downstream applications, lacking comprehensive cross-modality understanding. To address these issues, this paper introduces a framework MVCM that supports Multi-View and Cross-Modality alignment for Med-VQA and Med-ITR tasks. Our proposed method fully utilizes multi-view images in radiology datasets and aligns them at the study-level. We also employ various pretext tasks to support cross-modality alignment. We fine-tune the proposed model on downstream tasks Med-VQA and Med-ITR, outperforming state-of-the-art methods across multiple datasets. The code will be publicly available.
Cite
Text
Zou and Yin. "MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.Markdown
[Zou and Yin. "MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/zou2025cvprw-mvcm/)BibTeX
@inproceedings{zou2025cvprw-mvcm,
title = {{MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval}},
author = {Zou, Yuanhao and Yin, Zhaozheng},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2025},
pages = {180-190},
url = {https://mlanthology.org/cvprw/2025/zou2025cvprw-mvcm/}
}