OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Abstract
Real-world settings where language models (LMs) are deployed --- in domains spanning healthcare, finance, and other forms of knowledge work --- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce \textsc{OpenEstimate}, an extensible, multi-domain benchmark for evaluating LMs on probabilistic estimation tasks that require models to synthesize knowledge from pretraining and express predictions as Bayesian priors. We assess these priors for accuracy and calibration. Across six frontier models, we find that LM-elicited priors are worth the equivalent of about five samples from the underlying data distribution, and that posteriors computed using LM priors tend to be more accurate than those computed using a naive prior. At the same time, the relationship between model accuracy and confidence is weak across the board, indicating the value of developing new methods to improve calibration. The \textsc{OpenEstimate} benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
Cite
Text
Marzoev et al. "OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data." International Conference on Learning Representations, 2026.Markdown
[Marzoev et al. "OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/marzoev2026iclr-openestimate/)BibTeX
@inproceedings{marzoev2026iclr-openestimate,
title = {{OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data}},
author = {Marzoev, Alana and Ross, Jillian and Andreas, Jacob},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/marzoev2026iclr-openestimate/}
}