SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Abstract

We introduce SealQA, a challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) SEAL-0 (main) and (2) SEAL-HARD, both of which assess factual accuracy and reasoning capabilities, where SEAL-0 targets the most challenging questions that frontier non-reasoning models (e.g., GPT-4.1) answer with near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models. Even frontier reasoning models face significant challenges across SealQA flavors. On SEAL-0, GPT-5 with tools achieves only 43.2% accuracy at its best reasoning effort. We also find that even advanced reasoning models (e.g., DeepSeek-R1) can be vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across GPT-5 and the o-series of models, with performance often plateauing or even declining early. Finally, while current models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at https://huggingface.co/datasets/vtllms/sealqa.

Cite

Text

Pham et al. "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models." International Conference on Learning Representations, 2026.

Markdown

[Pham et al. "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/pham2026iclr-sealqa/)

BibTeX

@inproceedings{pham2026iclr-sealqa,
  title     = {{SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models}},
  author    = {Pham, Thinh and Nguyen, Nguyen Phan and Zunjare, Pratibha and Chen, Weiyuan and Tseng, Yu-Min and Vu, Tu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/pham2026iclr-sealqa/}
}