A Theory for Worst-Case vs. Average-Case Guarantees for LLMs

Abstract

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured *on average* over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output *and* successfully proves its correctness to $V$. The *soundness* property of $V$ guarantees that, for *every* input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by $V$. We devise and analyze two generic methods for learning Self-Proving models: *Transcript Learning (TL)* which relies on access to transcripts of accepting interactions, and *Reinforcement Learning from Verifier Feedback (RLVF)* which trains a model by emulating interactions with the verifier.

Cite

Text

Amit et al. "A Theory for Worst-Case vs. Average-Case Guarantees for LLMs." Advances in Neural Information Processing Systems, 2025.

Markdown

[Amit et al. "A Theory for Worst-Case vs. Average-Case Guarantees for LLMs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/amit2025neurips-theory/)

BibTeX

@inproceedings{amit2025neurips-theory,
  title     = {{A Theory for Worst-Case vs. Average-Case Guarantees for LLMs}},
  author    = {Amit, Noga and Goldwasser, Shafi and Paradise, Orr and Rothblum, Guy N.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/amit2025neurips-theory/}
}