Test-Time Fairness and Robustness in Large Language Models

Cotta, Leonardo; Maddison, Chris J.

Test-Time Fairness and Robustness in Large Language Models

TMLR 2025

/tmlr/2025/cotta2025tmlr-testtime/

Abstract

Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Because only well-resourced corporations can train frontier LLMs, we need robust test-time strategies to control such biases. Existing solutions, which instruct the LLM to be fair or robust, rely on the model’s implicit understanding of bias. Causality provides a rich formalism through which we can be explicit about our debiasing requirements. Yet, as we show, a naive application of the standard causal debiasing strategy, counterfactual data augmentation, fails to fulfill individual-level debiasing requirements at test time. To address this, we develop stratified invariance, a flexible debiasing notion that can capture a range of debiasing requirements, from population level to individual level, through an additional measurement that stratifies the predictions. We developed a complete test for this new approach and introduced a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs across a suite of synthetic and real-world benchmarks without requiring additional data, finetuning or pre-training.

PDF TMLR Semantic Scholar

Cite

Text

Cotta and Maddison. "Test-Time Fairness and Robustness in Large Language Models." Transactions on Machine Learning Research, 2025.

Markdown

[Cotta and Maddison. "Test-Time Fairness and Robustness in Large Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/cotta2025tmlr-testtime/)

BibTeX

@article{cotta2025tmlr-testtime,
  title     = {{Test-Time Fairness and Robustness in Large Language Models}},
  author    = {Cotta, Leonardo and Maddison, Chris J.},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/cotta2025tmlr-testtime/}
}