SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Diao, Muxi; Li, Rumei; Liu, Shiyang; Liao, Guogang; Wang, Jingang; Cai, Xunliang; Xu, Weiran

doi:10.1609/AAAI.V39I22.34549

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu

AAAI 2025 pp. 23778-23786

doi:10.1609/AAAI.V39I22.34549 /aaai/2025/diao2025aaai-seas/

Abstract

As Large Language Models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS.

PDF AAAI Semantic Scholar

Cite

Text

Diao et al. "SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34549

Markdown

[Diao et al. "SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/diao2025aaai-seas/) doi:10.1609/AAAI.V39I22.34549

BibTeX

@inproceedings{diao2025aaai-seas,
  title     = {{SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models}},
  author    = {Diao, Muxi and Li, Rumei and Liu, Shiyang and Liao, Guogang and Wang, Jingang and Cai, Xunliang and Xu, Weiran},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {23778-23786},
  doi       = {10.1609/AAAI.V39I22.34549},
  url       = {https://mlanthology.org/aaai/2025/diao2025aaai-seas/}
}