Reliable Weak-to-Strong Monitoring of LLM Agents

Abstract

We stress test monitoring systems for detecting covert misbehavior in LLM agents (e.g., secretly exfiltrating data). We propose a monitor red teaming (MRT) workflow that varies agent and monitor awareness, adversarial evasion strategies, and evaluation across tool-calling (SHADE-Arena) and computer-use (CUA-SHADE-Arena) environments. We benchmark standard monitor scaffoldings and introduce a hybrid hierarchical--sequential design. Our experiments yield three findings. First, agent awareness dominates monitor awareness: agents that know they are monitored substantially degrade detection, while increasing monitor awareness helps less than expected. Second, monitor scaffolding matters: our hybrid design consistently outperforms baselines and enables weaker monitors to oversee stronger agents (a weak-to-strong effect). Third, targeted human oversight is key: escalating only pre-flagged cases improves TPR by 15% at FPR=0.01. Our work positions MRT as a standard workflow for stress-testing oversight, revealing robustness gaps in both LLM- and human-based monitoring. We release code, data, and logs to support further research.

Cite

Text

Kale et al. "Reliable Weak-to-Strong Monitoring of LLM Agents." International Conference on Learning Representations, 2026.

Markdown

[Kale et al. "Reliable Weak-to-Strong Monitoring of LLM Agents." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kale2026iclr-reliable/)

BibTeX

@inproceedings{kale2026iclr-reliable,
  title     = {{Reliable Weak-to-Strong Monitoring of LLM Agents}},
  author    = {Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and Knight, Christina Q and Wang, Zifan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kale2026iclr-reliable/}
}