A Gold Standard Dataset for the Reviewer Assignment Problem

Abstract

Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the “similarity score’’ — a numerical estimate of the expertise of a reviewer in reviewing a paper — and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our four main findings are: - All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem. - Most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter2 algorithm performs best. - The classical TF-IDF algorithm which can use full texts of papers is on par with Specter2 that uses only titles and abstracts. - The performance of off-the-shelf LLMs is worse than the specialized algorithms. We encourage researchers to participate in our survey and contribute their data to the dataset here: https://forms.gle/SP1Rh8eivGz54xR37

Cite

Text

Stelmakh et al. "A Gold Standard Dataset for the Reviewer Assignment Problem." Transactions on Machine Learning Research, 2025.

Markdown

[Stelmakh et al. "A Gold Standard Dataset for the Reviewer Assignment Problem." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/stelmakh2025tmlr-gold/)

BibTeX

@article{stelmakh2025tmlr-gold,
  title     = {{A Gold Standard Dataset for the Reviewer Assignment Problem}},
  author    = {Stelmakh, Ivan and Wieting, John Frederick and Xi, Yang and Neubig, Graham and Shah, Nihar B},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/stelmakh2025tmlr-gold/}
}