MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Fleming, Scott L.; Lozano, Alejandro; Haberkorn, William J.; Jindal, Jenelle A.; Reis, Eduardo Pontes; Thapa, Rahul; Blankemeier, Louis; Genkins, Julian Z.; Steinberg, Ethan; Nayak, Ashwin; Patel, Birju S.; Chiang, Chia-Chun; Callahan, Alison; Huo, Zepeng; Gatidis, Sergios; Adams, Scott J.; Fayanju, Oluseyi; Shah, Shreya J.; Savage, Thomas; Goh, Ethan; Chaudhari, Akshay S.; Aghaeepour, Nima; Sharp, Christopher D.; Pfeffer, Michael A.; Liang, Percy; Chen, Jonathan H.; Morse, Keith E.; Brunskill, Emma P.; Fries, Jason A.; Shah, Nigam H.

doi:10.1609/AAAI.V38I20.30205

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

AAAI 2024 pp. 22021-22030

doi:10.1609/AAAI.V38I20.30205 /aaai/2024/fleming2024aaai-medalign/

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

PDF AAAI Semantic Scholar

Cite

Text

Fleming et al. "MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I20.30205

Markdown

[Fleming et al. "MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/fleming2024aaai-medalign/) doi:10.1609/AAAI.V38I20.30205

BibTeX

@inproceedings{fleming2024aaai-medalign,
  title     = {{MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records}},
  author    = {Fleming, Scott L. and Lozano, Alejandro and Haberkorn, William J. and Jindal, Jenelle A. and Reis, Eduardo Pontes and Thapa, Rahul and Blankemeier, Louis and Genkins, Julian Z. and Steinberg, Ethan and Nayak, Ashwin and Patel, Birju S. and Chiang, Chia-Chun and Callahan, Alison and Huo, Zepeng and Gatidis, Sergios and Adams, Scott J. and Fayanju, Oluseyi and Shah, Shreya J. and Savage, Thomas and Goh, Ethan and Chaudhari, Akshay S. and Aghaeepour, Nima and Sharp, Christopher D. and Pfeffer, Michael A. and Liang, Percy and Chen, Jonathan H. and Morse, Keith E. and Brunskill, Emma P. and Fries, Jason A. and Shah, Nigam H.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {22021-22030},
  doi       = {10.1609/AAAI.V38I20.30205},
  url       = {https://mlanthology.org/aaai/2024/fleming2024aaai-medalign/}
}