Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

Abstract

Automated red-teaming has emerged as an essential approach for identifying vulnerabilities in large language models (LLMs). However, most existing methods rely on fixed attack templates and focus primarily on individual high-severity flaws, limiting their adaptability to evolving defenses and their ability to detect complex, high-exploitability vulnerabilities. To address these limitations, we propose AUTO-RT, a reinforcement learning framework designed for automatic jailbreak strategy exploration, i.e., discovering diverse and effective prompts capable of bypassing the safety restrictions of LLMs. AUTO-RT autonomously explores and optimizes attack strategies by interacting with the target model and generating crafted queries that trigger security failures. Specifically, AUTO-RT introduces two key techniques to improve exploration efficiency and attack effectiveness: 1) Dynamic Strategy Pruning, which focuses exploration on high-potential strategies by eliminating highly redundant paths early, and 2) Progressive Reward Tracking, which leverages intermediate downgrade models and a novel First Inverse Rate (FIR) metric to smooth sparse rewards and guide learning. Extensive experiments across diverse white-box and black-box LLM settings demonstrate that AUTO-RT significantly improves success rates (by up to 16.63%), expands vulnerability coverage, and accelerates discovery compared to existing methods.

Cite

Text

Liu et al. "Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-autort/)

BibTeX

@inproceedings{liu2026iclr-autort,
  title     = {{Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models}},
  author    = {Liu, Yanjiang and Zhou, Shuheng and Lu, Yaojie and Zhu, Huijia and Wang, Weiqiang and Lin, Hongyu and He, Ben and Han, Xianpei and Sun, Le},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-autort/}
}