DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Abstract

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose **DetectiveQA**, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100K tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges and data contamination issues. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.

Cite

Text

Xu et al. "DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Xu et al. "DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/xu2025iclrw-detectiveqa/)

BibTeX

@inproceedings{xu2025iclrw-detectiveqa,
  title     = {{DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels}},
  author    = {Xu, Zhe and Ye, Jiasheng and Liu, Xiaoran and Liu, Xiangyang and Sun, Tianxiang and Liu, Zhigeng and Guo, Qipeng and Li, Linlin and Liu, Qun and Huang, Xuanjing and Qiu, Xipeng},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/xu2025iclrw-detectiveqa/}
}