Aligning AI with Shared Human Values

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Cite

Text

Hendrycks et al. "Aligning AI with Shared Human Values." International Conference on Learning Representations, 2021.

Markdown

[Hendrycks et al. "Aligning AI with Shared Human Values." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/hendrycks2021iclr-aligning/)

BibTeX

@inproceedings{hendrycks2021iclr-aligning,
  title     = {{Aligning AI with Shared Human Values}},
  author    = {Hendrycks, Dan and Burns, Collin and Basart, Steven and Critch, Andrew and Li, Jerry and Song, Dawn and Steinhardt, Jacob},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/hendrycks2021iclr-aligning/}
}