SEAL: Systematic Error Analysis for Value ALignment

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base models. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate RM effectiveness, focusing on feature imprint, feature resistance, and feature robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them -- feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to slightly perturbed texts. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% resistance incidence in portions of the dataset where LM labelers disagreed with human preferences. We also find that misalignment stems from confusing entries in the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

Cite

Text

Revel et al. "SEAL: Systematic Error Analysis for Value ALignment." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I26.34973

Markdown

[Revel et al. "SEAL: Systematic Error Analysis for Value ALignment." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/revel2025aaai-seal/) doi:10.1609/AAAI.V39I26.34973

BibTeX

@inproceedings{revel2025aaai-seal,
  title     = {{SEAL: Systematic Error Analysis for Value ALignment}},
  author    = {Revel, Manon and Cargnelutti, Matteo and Eloundou, Tyna and Leppert, Greg},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {27599-27607},
  doi       = {10.1609/AAAI.V39I26.34973},
  url       = {https://mlanthology.org/aaai/2025/revel2025aaai-seal/}
}