FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Hu, Liang; Jiao, Jianpeng; Liu, Jiashuo; Mutu, Dongyuan; Ren, Yanle; Wen, Zhoufutu; Zhang, Kaiyuan; Zhang, Xuanliang; Gao, Xiang; He, Tianci; Hu, Fei; Liao, Yali; Wang, Zaiyuan; Liu, Jingkai; Daibin, Sun; Zeng, Ziqing; Zeng, Zhiyuan; Yang, Chenghao; Yang, Qianyu; Yin, Mingren; Zhang, Ge; Zhang, Xinyi; Zhao, Xiying; Zhenwei, Zhu; Namkoong, Hongseok; Huang, Wenhao

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

ICLR 2026

/iclr/2026/hu2026iclr-finsearchcomp/

Abstract

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks, Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation, closely reproducing real-world financial analyst workflows. To ensure difficulty and reliability, we engage $70$ professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes $635$ questions spanning global and Greater China markets, and we evaluate $21$ models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Hu et al. "FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Hu et al. "FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/hu2026iclr-finsearchcomp/)

BibTeX

@inproceedings{hu2026iclr-finsearchcomp,
  title     = {{FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning}},
  author    = {Hu, Liang and Jiao, Jianpeng and Liu, Jiashuo and Mutu, Dongyuan and Ren, Yanle and Wen, Zhoufutu and Zhang, Kaiyuan and Zhang, Xuanliang and Gao, Xiang and He, Tianci and Hu, Fei and Liao, Yali and Wang, Zaiyuan and Liu, Jingkai and Daibin, Sun and Zeng, Ziqing and Zeng, Zhiyuan and Yang, Chenghao and Yang, Qianyu and Yin, Mingren and Zhang, Ge and Zhang, Xinyi and Zhao, Xiying and Zhenwei, Zhu and Namkoong, Hongseok and Huang, Wenhao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/hu2026iclr-finsearchcomp/}
}