Single-Pass Detection of Jailbreaking Input in Large Language Models
Abstract
Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just a forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.
Cite
Text
Candogan et al. "Single-Pass Detection of Jailbreaking Input in Large Language Models." Transactions on Machine Learning Research, 2025.Markdown
[Candogan et al. "Single-Pass Detection of Jailbreaking Input in Large Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/candogan2025tmlr-singlepass/)BibTeX
@article{candogan2025tmlr-singlepass,
title = {{Single-Pass Detection of Jailbreaking Input in Large Language Models}},
author = {Candogan, Leyla Naz and Wu, Yongtao and Rocamora, Elias Abad and Chrysos, Grigorios and Cevher, Volkan},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/candogan2025tmlr-singlepass/}
}