Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, Dan Hendrycks

NeurIPS 2024

doi:10.52202/079017-2190 /neurips/2024/ren2024neurips-safetywashing/

Abstract

Performance on popular ML benchmarks is highly correlated with model scale, suggesting that most benchmarks tend to measure a similar underlying factor of general model capabilities. However, substantial research effort remains devoted to designing new benchmarks, many of which claim to measure novel phenomena. In the spirit of the Bitter Lesson, we leverage spectral analysis to measure an underlying capabilities component, the direction in benchmark-performance-space which explains most variation in model performance. In an extensive analysis of existing safety benchmarks, we find that variance in model performance on many safety benchmarks is largely explained by the capabilities component. In response, we argue that safety research should prioritize metrics which are not highly correlated with scale. Our work provides a lens to analyze both novel safety benchmarks and novel safety methods, which we hope will enable future work to make differential progress on safety.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Ren et al. "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?." Neural Information Processing Systems, 2024. doi:10.52202/079017-2190

Markdown

[Ren et al. "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/ren2024neurips-safetywashing/) doi:10.52202/079017-2190

BibTeX

@inproceedings{ren2024neurips-safetywashing,
  title     = {{Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?}},
  author    = {Ren, Richard and Basart, Steven and Khoja, Adam and Pan, Alexander and Gatti, Alice and Phan, Long and Yin, Xuwang and Mazeika, Mantas and Mukobi, Gabriel and Kim, Ryan Hwang and Fitz, Stephen and Hendrycks, Dan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2190},
  url       = {https://mlanthology.org/neurips/2024/ren2024neurips-safetywashing/}
}