Multi-Modal Pre-Training for Medical Vision-Language Understanding and Generation: An Empirical Study with a New Benchmark

Abstract

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

Cite

Text

Xu et al. "Multi-Modal Pre-Training for Medical Vision-Language Understanding and Generation: An Empirical Study with a New Benchmark." Proceedings of the Conference on Health, Inference, and Learning, 2023.

Markdown

[Xu et al. "Multi-Modal Pre-Training for Medical Vision-Language Understanding and Generation: An Empirical Study with a New Benchmark." Proceedings of the Conference on Health, Inference, and Learning, 2023.](https://mlanthology.org/chil/2023/xu2023chil-multimodal/)

BibTeX

@inproceedings{xu2023chil-multimodal,
  title     = {{Multi-Modal Pre-Training for Medical Vision-Language Understanding and Generation: An Empirical Study with a New Benchmark}},
  author    = {Xu, Li and Liu, Bo and Khan, Ameer Hamza and Fan, Lu and Wu, Xiao-Ming},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  year      = {2023},
  pages     = {117-132},
  volume    = {209},
  url       = {https://mlanthology.org/chil/2023/xu2023chil-multimodal/}
}