LFQA-E: Carefully Benchmarking Long-Form QA Evaluation

Fan, Yuchen; Ling, Chen; Zhong, Xin; Zhang, Shuo; Zhou, Heng; Zhang, Yuchen; Liang, Mingyu; Xie, Chengxing; Hua, Ermo; He, Zhizhou; Huang, Cheng; Chen, Gang; Ding, Ning; Zhou, Bowen

LFQA-E: Carefully Benchmarking Long-Form QA Evaluation

Yuchen Fan, Chen Ling, Xin Zhong, Shuo Zhang, Heng Zhou, Yuchen Zhang, Mingyu Liang, Chengxing Xie, Ermo Hua, Zhizhou He, Cheng Huang, Gang Chen, Ning Ding, Bowen Zhou

ICLR 2026

/iclr/2026/fan2026iclr-lfqae/

Abstract

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1,625 questions and 7,649 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Fan et al. "LFQA-E: Carefully Benchmarking Long-Form QA Evaluation." International Conference on Learning Representations, 2026.

Markdown

[Fan et al. "LFQA-E: Carefully Benchmarking Long-Form QA Evaluation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/fan2026iclr-lfqae/)

BibTeX

@inproceedings{fan2026iclr-lfqae,
  title     = {{LFQA-E: Carefully Benchmarking Long-Form QA Evaluation}},
  author    = {Fan, Yuchen and Ling, Chen and Zhong, Xin and Zhang, Shuo and Zhou, Heng and Zhang, Yuchen and Liang, Mingyu and Xie, Chengxing and Hua, Ermo and He, Zhizhou and Huang, Cheng and Chen, Gang and Ding, Ning and Zhou, Bowen},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/fan2026iclr-lfqae/}
}