Fast Proxies for LLM Robustness Evaluation
Abstract
Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct attacks. Even though direct attacks in particular do not achieve high ASR, we find that they and embedding-space attacks can predict attack success rates well, achieving $r_p=0.86$ (linear) and $r_s=0.97$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.
Cite
Text
Beyer et al. "Fast Proxies for LLM Robustness Evaluation." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Beyer et al. "Fast Proxies for LLM Robustness Evaluation." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/beyer2025iclrw-fast/)BibTeX
@inproceedings{beyer2025iclrw-fast,
title = {{Fast Proxies for LLM Robustness Evaluation}},
author = {Beyer, Tim and Schuchardt, Jan and Schwinn, Leo and Günnemann, Stephan},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/beyer2025iclrw-fast/}
}