Predicting the Performance of Black-Box Language Models with Follow-up Queries

Abstract

Reliably predicting the behavior of language models---such as whether their outputs are correct or have been adversarially manipulated---is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses _as_ representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can _even outperform white-box linear predictors_ that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.

Cite

Text

Sam et al. "Predicting the Performance of Black-Box Language Models with Follow-up Queries." Advances in Neural Information Processing Systems, 2025.

Markdown

[Sam et al. "Predicting the Performance of Black-Box Language Models with Follow-up Queries." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/sam2025neurips-predicting/)

BibTeX

@inproceedings{sam2025neurips-predicting,
  title     = {{Predicting the Performance of Black-Box Language Models with Follow-up Queries}},
  author    = {Sam, Dylan and Finzi, Marc Anton and Kolter, J Zico},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/sam2025neurips-predicting/}
}