Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Marhuenda, Luis-Jesus; Obrador-Reina, Miquel; Aas-Alas, Mohamed; Albiol, Alberto; Paredes, Roberto

Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Luis-Jesus Marhuenda, Miquel Obrador-Reina, Mohamed Aas-Alas, Alberto Albiol, Roberto Paredes

MIDL 2025

/midl/2025/marhuenda2025midl-unveiling/

Abstract

Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.

PDF MIDL OpenReview Semantic Scholar

Cite

Text

Marhuenda et al. "Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering." Medical Imaging with Deep Learning, 2025.

Markdown

[Marhuenda et al. "Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering." Medical Imaging with Deep Learning, 2025.](https://mlanthology.org/midl/2025/marhuenda2025midl-unveiling/)

BibTeX

@inproceedings{marhuenda2025midl-unveiling,
  title     = {{Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering}},
  author    = {Marhuenda, Luis-Jesus and Obrador-Reina, Miquel and Aas-Alas, Mohamed and Albiol, Alberto and Paredes, Roberto},
  booktitle = {Medical Imaging with Deep Learning},
  year      = {2025},
  url       = {https://mlanthology.org/midl/2025/marhuenda2025midl-unveiling/}
}