Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Abstract

Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.

Cite

Text

Rauh et al. "Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models." Neural Information Processing Systems, 2022.

Markdown

[Rauh et al. "Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/rauh2022neurips-characteristics/)

BibTeX

@inproceedings{rauh2022neurips-characteristics,
  title     = {{Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models}},
  author    = {Rauh, Maribeth and Mellor, John and Uesato, Jonathan and Huang, Po-Sen and Welbl, Johannes and Weidinger, Laura and Dathathri, Sumanth and Glaese, Amelia and Irving, Geoffrey and Gabriel, Iason and Isaac, William and Hendricks, Lisa Anne},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/rauh2022neurips-characteristics/}
}