SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Abstract

Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model’s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

Cite

Text

Jiang et al. "SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I23.34603

Markdown

[Jiang et al. "SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/jiang2025aaai-self/) doi:10.1609/AAAI.V39I23.34603

BibTeX

@inproceedings{jiang2025aaai-self,
  title     = {{SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses}},
  author    = {Jiang, Dongwei and Zhang, Jingyu and Weller, Orion and Weir, Nathaniel and Van Durme, Benjamin and Khashabi, Daniel},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {24266-24275},
  doi       = {10.1609/AAAI.V39I23.34603},
  url       = {https://mlanthology.org/aaai/2025/jiang2025aaai-self/}
}