ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Ruan, Jie; Nair, Inderjeet Jayakumar; Cao, Shuyang; Liu, Amy; Munir, Sheza; Pollens-Dempsey, Micah; Chiang, Yune-Ting Tiffany; Kates, Lucy R.; David, Nicholas; Chen, Sihan; Yang, Ruxin; Yang, Yuqian; Gump, Jihyun Jasmine; Bialek, Tessa; Sankaran, Vivek S; Schlanger, Margo; Wang, Lu

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Jie Ruan, Inderjeet Jayakumar Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Yune-Ting Tiffany Chiang, Lucy R. Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jihyun Jasmine Gump, Tessa Bialek, Vivek S Sankaran, Margo Schlanger, Lu Wang

ICLR 2026

/iclr/2026/ruan2026iclr-expertlongbench/

Abstract

This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 15 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ruan et al. "ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists." International Conference on Learning Representations, 2026.

Markdown

[Ruan et al. "ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ruan2026iclr-expertlongbench/)

BibTeX

@inproceedings{ruan2026iclr-expertlongbench,
  title     = {{ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists}},
  author    = {Ruan, Jie and Nair, Inderjeet Jayakumar and Cao, Shuyang and Liu, Amy and Munir, Sheza and Pollens-Dempsey, Micah and Chiang, Yune-Ting Tiffany and Kates, Lucy R. and David, Nicholas and Chen, Sihan and Yang, Ruxin and Yang, Yuqian and Gump, Jihyun Jasmine and Bialek, Tessa and Sankaran, Vivek S and Schlanger, Margo and Wang, Lu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ruan2026iclr-expertlongbench/}
}