Algorithmic Oversight for Deceptive Reasoning

Ege Onur Taga, Mingchen Li, Yongqi Chen, Samet Oymak

NeurIPSW 2024

/neuripsw/2024/taga2024neuripsw-algorithmic/

Abstract

This paper investigates the oversight problem where a large language model (LLM) provides output that may contain deliberate adversarial errors and an oversight LLM/agent aims to detect them. We study this question in the context of mathematical reasoning. Our study is conducted in two primary steps: Firstly, we develop attack strategies aimed at inducing deliberate reasoning errors that could deceive the oversight agent. Here, we find that even strong models can be deceived, which highlights a need for defense mechanisms. Secondly, we propose a list of defense mechanisms to protect against these attacks by augmenting oversight capabilities. Through these, we find that structured prompting, fine-tuning, and greybox access can noticeably improve detection accuracy. Specifically, we introduce ProbShift, a novel algorithm utilizing token-probabilities of the generated text for detection. We find that ProbShift can outperform GPT-3.5 and can be boosted with LLM-based oversight. Overall, this work demonstrates the feasibility and importance of developing algorithmic oversight mechanisms for LLMs, with emphasis on complex tasks requiring logical/mathematical reasoning.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Taga et al. "Algorithmic Oversight for Deceptive Reasoning." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.

Markdown

[Taga et al. "Algorithmic Oversight for Deceptive Reasoning." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.](https://mlanthology.org/neuripsw/2024/taga2024neuripsw-algorithmic/)

BibTeX

@inproceedings{taga2024neuripsw-algorithmic,
  title     = {{Algorithmic Oversight for Deceptive Reasoning}},
  author    = {Taga, Ege Onur and Li, Mingchen and Chen, Yongqi and Oymak, Samet},
  booktitle = {NeurIPS 2024 Workshops: Red_Teaming_GenAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/taga2024neuripsw-algorithmic/}
}