Collapsed Language Models Promote Fairness

Abstract

To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at https://github.com/Xujxyang/Fairness-NC-main

Cite

Text

Xu et al. "Collapsed Language Models Promote Fairness." International Conference on Learning Representations, 2025.

Markdown

[Xu et al. "Collapsed Language Models Promote Fairness." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xu2025iclr-collapsed/)

BibTeX

@inproceedings{xu2025iclr-collapsed,
  title     = {{Collapsed Language Models Promote Fairness}},
  author    = {Xu, Jingxuan and Chen, Wuyang and Li, Linyi and Zhao, Yao and Wei, Yunchao},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/xu2025iclr-collapsed/}
}