Algorithmic Oversight for Deceptive Reasoning
Abstract
This paper investigates the oversight problem where a large language model (LLM) provides output that may contain deliberate adversarial errors and an oversight LLM/agent aims to detect them. We study this question in the context of mathematical reasoning. Our study is conducted in two primary steps: Firstly, we develop attack strategies aimed at inducing deliberate reasoning errors that could deceive the oversight agent. Here, we find that even strong models can be deceived, which highlights a need for defense mechanisms. Secondly, we propose a list of defense mechanisms to protect against these attacks by augmenting oversight capabilities. Through these, we find that structured prompting, fine-tuning, and greybox access can noticeably improve detection accuracy. Specifically, we introduce ProbShift, a novel algorithm utilizing token-probabilities of the generated text for detection. We find that ProbShift can outperform GPT-3.5 and can be boosted with LLM-based oversight. Overall, this work demonstrates the feasibility and importance of developing algorithmic oversight mechanisms for LLMs, with emphasis on complex tasks requiring logical/mathematical reasoning.
Cite
Text
Taga et al. "Algorithmic Oversight for Deceptive Reasoning." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.Markdown
[Taga et al. "Algorithmic Oversight for Deceptive Reasoning." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.](https://mlanthology.org/neuripsw/2024/taga2024neuripsw-algorithmic/)BibTeX
@inproceedings{taga2024neuripsw-algorithmic,
title = {{Algorithmic Oversight for Deceptive Reasoning}},
author = {Taga, Ege Onur and Li, Mingchen and Chen, Yongqi and Oymak, Samet},
booktitle = {NeurIPS 2024 Workshops: Red_Teaming_GenAI},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/taga2024neuripsw-algorithmic/}
}