Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits

Abstract

We present Second Thoughts, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thoughts not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.

Cite

Text

Liu et al. "Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits." Neural Information Processing Systems, 2022.

Markdown

[Liu et al. "Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/liu2022neurips-second/)

BibTeX

@inproceedings{liu2022neurips-second,
  title     = {{Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits}},
  author    = {Liu, Ruibo and Jia, Chenyan and Zhang, Ge and Zhuang, Ziyu and Liu, Tony and Vosoughi, Soroush},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/liu2022neurips-second/}
}