Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs

Abstract

Large language models (LLMs) are increasingly used to examine tabular datasets and aid decision-making in critical sectors such as clinical medicine. Standard fairness metrics, which were largely designed to evaluate supervised learning models, are not well suited to this setting. This paper proposes a novel dichotomy between \textit{intrinsic} and \textit{behavioral} fairness, and details a comprehensive framework for evaluating both in LLMs. The former is encoded in a language model's embeddings through procedures like pre-training, preference fine-tuning, etc. The latter reflects the application of LLMs in real-world scenarios. Though current works largely prioritize intrinsic over behavioral fairness, we argue that the latter is much more important in practice. We illustrate the gap between these two concepts in a series of experiments on a semi-synthetic dataset inspired by a large scale study of racial bias in health algorithms. Our results suggest a new direction for fairness research in LLMs, as well as some practical guidelines to mitigate harmful outcomes.

Cite

Text

Wei et al. "Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs." ICLR 2024 Workshops: R2-FM, 2024.

Markdown

[Wei et al. "Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/wei2024iclrw-actions/)

BibTeX

@inproceedings{wei2024iclrw-actions,
  title     = {{Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs}},
  author    = {Wei, Qiyao and Chan, Alex James and Goetz, Lea and Watson, David and van der Schaar, Mihaela},
  booktitle = {ICLR 2024 Workshops: R2-FM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/wei2024iclrw-actions/}
}