STAR: Strategy-Driven Automatic Jailbreak Red-Teaming for Large Language Model

Abstract

Jailbreaking refers to techniques that bypass the safety alignment of large language models (LLMs) to elicit harmful outputs, and automated red-teaming has become a key approach for detecting such vulnerabilities before deployment. However, most existing red-teaming methods operate directly in text space, where they tend to generate semantically similar prompts and thus fail to probe the broader spectrum of latent vulnerabilities within a model. To address this limitation, we shift the exploration of jailbreak strategies from conventional text space to the model’s latent activation space and propose STAR (**ST**rategy-driven **A**utomatic Jailbreak **R**ed-teaming), a black-box framework for systematically generating jailbreak prompts. STAR is composed of two modules: (i) strategy generation module, which extracts the principal components of existing strategies and recombines them to generate novel ones; and (ii) prompt generation module, which translates abstract strategies into concrete jailbreak prompts with high success rates. Experimental results show that STAR substantially outperforms state-of-the-art baselines in terms of both attack success rate and strategy diversity. These findings highlight critical vulnerabilities in current alignment techniques and establish STAR as a more powerful paradigm for comprehensive LLM security evaluation.

Cite

Text

Liu et al. "STAR: Strategy-Driven Automatic Jailbreak Red-Teaming for Large Language Model." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "STAR: Strategy-Driven Automatic Jailbreak Red-Teaming for Large Language Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-star/)

BibTeX

@inproceedings{liu2026iclr-star,
  title     = {{STAR: Strategy-Driven Automatic Jailbreak Red-Teaming for Large Language Model}},
  author    = {Liu, Jianing and Li, Qingming and Chen, Jiahao and Zeng, Rui and Zhao, Binbin and Ji, Shouling},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-star/}
}