Preference Optimization for Concept Bottleneck Models

Abstract

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-understandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.

Cite

Text

Penaloza et al. "Preference Optimization for Concept Bottleneck Models." ICLR 2025 Workshops: Bi-Align, 2025.

Markdown

[Penaloza et al. "Preference Optimization for Concept Bottleneck Models." ICLR 2025 Workshops: Bi-Align, 2025.](https://mlanthology.org/iclrw/2025/penaloza2025iclrw-preference/)

BibTeX

@inproceedings{penaloza2025iclrw-preference,
  title     = {{Preference Optimization for Concept Bottleneck Models}},
  author    = {Penaloza, Emiliano and Zhang, Tianyue H. and Charlin, Laurent and Zarlenga, Mateo Espinosa},
  booktitle = {ICLR 2025 Workshops: Bi-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/penaloza2025iclrw-preference/}
}