LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)

Abstract

We introduce LLM Stinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLM Stinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLM Stinger across open and closed-source models.

Cite

Text

Jha et al. "LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35263

Markdown

[Jha et al. "LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/jha2025aaai-llm/) doi:10.1609/AAAI.V39I28.35263

BibTeX

@inproceedings{jha2025aaai-llm,
  title     = {{LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)}},
  author    = {Jha, Piyush and Arora, Arnav and Ganesh, Vijay},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {29393-29395},
  doi       = {10.1609/AAAI.V39I28.35263},
  url       = {https://mlanthology.org/aaai/2025/jha2025aaai-llm/}
}