Modeling Human Beliefs About AI Behavior for Scalable Oversight

Abstract

As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.

Cite

Text

Lang and Forré. "Modeling Human Beliefs About AI Behavior for Scalable Oversight." Transactions on Machine Learning Research, 2025.

Markdown

[Lang and Forré. "Modeling Human Beliefs About AI Behavior for Scalable Oversight." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/lang2025tmlr-modeling/)

BibTeX

@article{lang2025tmlr-modeling,
  title     = {{Modeling Human Beliefs About AI Behavior for Scalable Oversight}},
  author    = {Lang, Leon and Forré, Patrick},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/lang2025tmlr-modeling/}
}