A Benchmark for Scalable Oversight Mechanisms

Abstract

As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.

Cite

Text

Sudhir et al. "A Benchmark for Scalable Oversight Mechanisms." ICLR 2025 Workshops: Bi-Align, 2025.

Markdown

[Sudhir et al. "A Benchmark for Scalable Oversight Mechanisms." ICLR 2025 Workshops: Bi-Align, 2025.](https://mlanthology.org/iclrw/2025/sudhir2025iclrw-benchmark/)

BibTeX

@inproceedings{sudhir2025iclrw-benchmark,
  title     = {{A Benchmark for Scalable Oversight Mechanisms}},
  author    = {Sudhir, Abhimanyu Pallavi and Kaunismaa, Jackson and Panickssery, Arjun},
  booktitle = {ICLR 2025 Workshops: Bi-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/sudhir2025iclrw-benchmark/}
}