Pitfalls in Evaluating Inference-Time Methods for Improving LLM Reliability

Abstract

Though Large Language Models (LLMs) have demonstrated remarkable capabilities, they are still prone to outputting falsehoods using seemingly persuasive language. Many recent works attempt to address this problem by using LLMs in a framework where a single seed prompt results in a series of interactions involving augmented prompts with an otherwise unchanged LLM, and the results are aggregated with a goal of producing a more reliable output. We consider the replicability and generalizability of evaluations of inference-time methods intended to improve the reliability of responses from a base LLMs. We survey how methods have been evaluated in the literature and find a great variety of benchmarks and models in use. Motivated by this, we conduct our own evaluation to evaluate the effectiveness of a few methods across a range of benchmarks and models. Our evaluation reveals that while these techniques show promise in improving reliability, there is still significant variability in performance across different domains and tasks, and methods that show substantial improvements on weaker base models often do not improve reliability for better base models.

Cite

Text

Jerge and Evans. "Pitfalls in Evaluating Inference-Time Methods for Improving LLM Reliability." Transactions on Machine Learning Research, 2025.

Markdown

[Jerge and Evans. "Pitfalls in Evaluating Inference-Time Methods for Improving LLM Reliability." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/jerge2025tmlr-pitfalls/)

BibTeX

@article{jerge2025tmlr-pitfalls,
  title     = {{Pitfalls in Evaluating Inference-Time Methods for Improving LLM Reliability}},
  author    = {Jerge, Michael M. and Evans, David},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/jerge2025tmlr-pitfalls/}
}