Exploring Scaling Trends in LLM Robustness

Abstract

Language model capabilities predictably improve from scaling the model's size and training data. Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities. Yet these models suffer from adversarial prompts such as ``jailbreaks'' that hijack models to perform undesired behavior, posing a significant risk of misuse. Prior work has found that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically, finding that larger models respond substantially more effectively to adversarial training, but there is little to no benefit from model scale in the absence of defenses.

Cite

Text

Howe et al. "Exploring Scaling Trends in LLM Robustness." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Howe et al. "Exploring Scaling Trends in LLM Robustness." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/howe2024icmlw-exploring/)

BibTeX

@inproceedings{howe2024icmlw-exploring,
  title     = {{Exploring Scaling Trends in LLM Robustness}},
  author    = {Howe, Nikolaus H. R. and Zając, Michał and McKenzie, Ian R. and Hollinsworth, Oskar John and Bacon, Pierre-Luc and Gleave, Adam},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/howe2024icmlw-exploring/}
}