SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Zhao, Ming; Dong, Wenhui; Zhang, Yang; XiangZheng,; Zhang, Zhonghao; Zhou, Zian; Guan, Yunzhi; Xu, Liukun; Peng, Wei; Gong, Zhaoyang; Zhang, Zhicheng; Li, Dachuan; Ma, Xiaosheng; Ma, Yuli; Ni, Jianing; Jiang, Changjiang; Tian, Lixia; Qixin, Chen; Kaishun, Xia; Liu, Pingping; Zhang, Tongshun; ZhiqiangLiu,; Bi, Zhongan; Si, Chenyang; Sun, Tiansheng; Shan, Caifeng

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

ICLR 2026

/iclr/2026/zhao2026iclr-spinebench/

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and $\sim$1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhao et al. "SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus." International Conference on Learning Representations, 2026.

Markdown

[Zhao et al. "SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhao2026iclr-spinebench/)

BibTeX

@inproceedings{zhao2026iclr-spinebench,
  title     = {{SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus}},
  author    = {Zhao, Ming and Dong, Wenhui and Zhang, Yang and XiangZheng,  and Zhang, Zhonghao and Zhou, Zian and Guan, Yunzhi and Xu, Liukun and Peng, Wei and Gong, Zhaoyang and Zhang, Zhicheng and Li, Dachuan and Ma, Xiaosheng and Ma, Yuli and Ni, Jianing and Jiang, Changjiang and Tian, Lixia and Qixin, Chen and Kaishun, Xia and Liu, Pingping and Zhang, Tongshun and ZhiqiangLiu,  and Bi, Zhongan and Si, Chenyang and Sun, Tiansheng and Shan, Caifeng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhao2026iclr-spinebench/}
}