AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
Abstract
We introduce AegisLLM, a cooperative multi-agent defense against prompt injection, adversarial manipulation, and information leakage. In AegisLLM, a structured society of autonomous agents — orchestrator, deflector, responder, and evaluator — collaborate (via communication) to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time—both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)—substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, and lower false refusal rate than state-ot-the-art methods on PHTest. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Our code is available at https://github.com/zikuicai/agentic-safety.
Cite
Text
Cai et al. "AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Cai et al. "AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/cai2025iclrw-aegisllm/)BibTeX
@inproceedings{cai2025iclrw-aegisllm,
title = {{AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security}},
author = {Cai, Zikui and Shabihi, Shayan and An, Bang and Che, Zora and Bartoldson, Brian R. and Kailkhura, Bhavya and Goldstein, Tom and Huang, Furong},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/cai2025iclrw-aegisllm/}
}