An Auditing Test to Detect Behavioral Shift in Language Models

Abstract

Ensuring language models (LMs) align with societal values has become paramount as LMs continue to achieve near-human performance across various tasks. In this work, we address the problem of a vendor deploying an unaligned model to consumers. For instance, unscrupulous vendors may wish to deploy unaligned models if they increase overall profit. Alternatively, an attacker may compromise a vendor and modify their model to produce unintended behavior. In these cases, an external auditing process can fail: if a vendor/attacker knows the model is being audited, they can swap in an aligned model during this evaluation and swap it out once the evaluation is complete. To address this, we propose a regulatory framework involving a continuous, online auditing process to ensure that deployed models remain aligned throughout their life cycle. We give theoretical guarantees that, with access to an aligned model, one can detect an unaligned model via this process solely from model generations, given enough samples. This allows a regulator to impersonate a consumer, preventing the vendor/attacker from surreptitiously swapping in an aligned model during evaluation. We hope that this work extends the discourse on AI alignment via regulatory practices and encourages additional solutions for consumer rights protection for LMs.

Cite

Text

Richter et al. "An Auditing Test to Detect Behavioral Shift in Language Models." ICML 2024 Workshops: FM-Wild, 2024.

Markdown

[Richter et al. "An Auditing Test to Detect Behavioral Shift in Language Models." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/richter2024icmlw-auditing/)

BibTeX

@inproceedings{richter2024icmlw-auditing,
  title     = {{An Auditing Test to Detect Behavioral Shift in Language Models}},
  author    = {Richter, Leo and Agrawal, Nitin and He, Xuanli and Minervini, Pasquale and Kusner, Matt},
  booktitle = {ICML 2024 Workshops: FM-Wild},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/richter2024icmlw-auditing/}
}