Aligning AI with Shared Human Values

Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob

Aligning AI with Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

ICLR 2021

/iclr/2021/hendrycks2021iclr-aligning/

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

PDF ICLR Semantic Scholar

Cite

Text

Hendrycks et al. "Aligning AI with Shared Human Values." International Conference on Learning Representations, 2021.

Markdown

[Hendrycks et al. "Aligning AI with Shared Human Values." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/hendrycks2021iclr-aligning/)

BibTeX

@inproceedings{hendrycks2021iclr-aligning,
  title     = {{Aligning AI with Shared Human Values}},
  author    = {Hendrycks, Dan and Burns, Collin and Basart, Steven and Critch, Andrew and Li, Jerry and Song, Dawn and Steinhardt, Jacob},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/hendrycks2021iclr-aligning/}
}