PiCO: Peer Review in LLMs Based on Consistency Optimization

Abstract

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically without any human feedback. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM’s response score is jointly determined by other anonymous ones. During this process, we found that those answers that are more recognized by other ``reviewers'' (models) usually come from LLMs with stronger abilities, while these models can also evaluate others' answers more accurately. We formalize it as a consistency assumption, i.e., the ability and score of the model usually have consistency. We exploit this to optimize each model's confidence, thereby re-ranking the LLMs to be closer to human rankings. We perform experiments on multiple datasets with standard rank-based metrics, validating the effectiveness of the proposed approach.

Cite

Text

Ning et al. "PiCO: Peer Review in LLMs Based on Consistency Optimization." International Conference on Learning Representations, 2025.

Markdown

[Ning et al. "PiCO: Peer Review in LLMs Based on Consistency Optimization." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ning2025iclr-pico/)

BibTeX

@inproceedings{ning2025iclr-pico,
  title     = {{PiCO: Peer Review in LLMs Based on Consistency Optimization}},
  author    = {Ning, Kun-Peng and Yang, Shuo and Liu, Yuyang and Yao, Jia-Yu and Liu, Zhenhui and Tian, Yonghong and Song, Yibing and Yuan, Li},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/ning2025iclr-pico/}
}