DiversityMedQA: A Benchmark for Assessing Demographic Biases in Medical Diagnosis Using Large Language Models
Abstract
As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
Cite
Text
Rawat et al. "DiversityMedQA: A Benchmark for Assessing Demographic Biases in Medical Diagnosis Using Large Language Models." NeurIPS 2024 Workshops: AIM-FM, 2024.Markdown
[Rawat et al. "DiversityMedQA: A Benchmark for Assessing Demographic Biases in Medical Diagnosis Using Large Language Models." NeurIPS 2024 Workshops: AIM-FM, 2024.](https://mlanthology.org/neuripsw/2024/rawat2024neuripsw-diversitymedqa/)BibTeX
@inproceedings{rawat2024neuripsw-diversitymedqa,
title = {{DiversityMedQA: A Benchmark for Assessing Demographic Biases in Medical Diagnosis Using Large Language Models}},
author = {Rawat, Rajat and McBride, Hudson and Ghosh, Rajarshi and Nirmal, Dhiyaan Chakkresh and Moon, Jong and Alamari, Dhruv and Zhu, Kevin and O'Brien, Sean},
booktitle = {NeurIPS 2024 Workshops: AIM-FM},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/rawat2024neuripsw-diversitymedqa/}
}