Consistency Checks for Language Model Forecasters

Abstract

Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters *instantaneously*? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on *arbitrage*: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60\% probability of winning the 2024 US presidential election, an arbitrageur could trade against the forecaster's predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate strongly with LLM forecasters' ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting.

Cite

Text

Paleka et al. "Consistency Checks for Language Model Forecasters." International Conference on Learning Representations, 2025.

Markdown

[Paleka et al. "Consistency Checks for Language Model Forecasters." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/paleka2025iclr-consistency/)

BibTeX

@inproceedings{paleka2025iclr-consistency,
  title     = {{Consistency Checks for Language Model Forecasters}},
  author    = {Paleka, Daniel and Sudhir, Abhimanyu Pallavi and Alvarez, Alejandro and Bhat, Vineeth and Shen, Adam and Wang, Evan and Tramèr, Florian},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/paleka2025iclr-consistency/}
}