Towards Shutdownable Agents via Stochastic Choice

Thornley, Elliott; Roman, Alexander; Ziakas, Christos; Thomson, Louis; Ho, Leyton

Towards Shutdownable Agents via Stochastic Choice

Elliott Thornley, Alexander Roman, Christos Ziakas, Louis Thomson, Leyton Ho

TMLR 2025

/tmlr/2025/thornley2025tmlr-shutdownable/

Abstract

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

PDF TMLR Semantic Scholar

Cite

Text

Thornley et al. "Towards Shutdownable Agents via Stochastic Choice." Transactions on Machine Learning Research, 2025.

Markdown

[Thornley et al. "Towards Shutdownable Agents via Stochastic Choice." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/thornley2025tmlr-shutdownable/)

BibTeX

@article{thornley2025tmlr-shutdownable,
  title     = {{Towards Shutdownable Agents via Stochastic Choice}},
  author    = {Thornley, Elliott and Roman, Alexander and Ziakas, Christos and Thomson, Louis and Ho, Leyton},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/thornley2025tmlr-shutdownable/}
}