OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

Koike, Ryuto; Kaneko, Masahiro; Okazaki, Naoaki

doi:10.1609/AAAI.V38I19.30120

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki

AAAI 2024 pp. 21258-21266

doi:10.1609/AAAI.V38I19.30120 /aaai/2024/koike2024aaai-outfox/

Abstract

Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

PDF AAAI Semantic Scholar

Cite

Text

Koike et al. "OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I19.30120

Markdown

[Koike et al. "OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/koike2024aaai-outfox/) doi:10.1609/AAAI.V38I19.30120

BibTeX

@inproceedings{koike2024aaai-outfox,
  title     = {{OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples}},
  author    = {Koike, Ryuto and Kaneko, Masahiro and Okazaki, Naoaki},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {21258-21266},
  doi       = {10.1609/AAAI.V38I19.30120},
  url       = {https://mlanthology.org/aaai/2024/koike2024aaai-outfox/}
}