LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Wu, Yuhao; Hee, Ming Shan; Hu, Zhiqiang; Lee, Roy Ka-Wei

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee

ICLR 2025

/iclr/2025/wu2025iclr-longgenbench/

Abstract

Current benchmarks like ``$\textit{Needle-in-a-Haystack}$'' ($\textit{NIAH}$), $\textit{Ruler}$, and $\textit{Needlebench}$ focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences—a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce $\textit{LongGenBench}$, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, $\textit{LongGenBench}$ evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on $\textit{Ruler}$, all models struggled with long text generation on $\textit{LongGenBench}$, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation. We open-source $\textit{LongGenBench}$ to promote comprehensive evaluation and improvement in this critical area, with code and data available at ${anonymousurl}$.

PDF ICLR Semantic Scholar

Cite

Text

Wu et al. "LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs." International Conference on Learning Representations, 2025.

Markdown

[Wu et al. "LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/wu2025iclr-longgenbench/)

BibTeX

@inproceedings{wu2025iclr-longgenbench,
  title     = {{LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs}},
  author    = {Wu, Yuhao and Hee, Ming Shan and Hu, Zhiqiang and Lee, Roy Ka-Wei},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/wu2025iclr-longgenbench/}
}