Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Tan, Zhen; Cheng, Lu; Wang, Song; Yuan, Bo; Li, Jundong; Liu, Huan

doi:10.24963/IJCAI.2025/1221

Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, Huan Liu

IJCAI 2025 pp. 10942-10946

doi:10.24963/IJCAI.2025/1221 /ijcai/2025/tan2025ijcai-interpreting/

Abstract

Pretrained language models (PLMs) achieve state-of-the-art results but often function as ``black boxes'', hindering interpretability and responsible deployment. While methods like attention analysis exist, they often lack clarity and intuitiveness. We propose interpreting PLMs through high-level, human-understandable concepts using Concept Bottleneck Models (CBMs). This extended abstract introduces C3M (ChatGPT-guided Concept augmentation with Concept-level Mixup), a novel framework for training Concept-Bottleneck-Enabled PLMs (CBE-PLMs). C3M leverages Large Language Models (LLMs) like ChatGPT to augment concept sets and generate noisy concept labels, combined with a concept-level MixUp mechanism to enhance robustness and effectively learn from both human-annotated and machine-generated concepts. Empirical results show our approach provides intuitive explanations, aids model diagnosis via test-time intervention, and improves the interpretability-utility trade-off, even with limited or noisy concept annotations. This is an concise version of [Tan et al., 2024b], recipient of the Best Paper Award at PAKDD 2024. Code and data are released at https://github.com/Zhen-Tan-dmml/CBM_NLP.git.

PDF IJCAI Semantic Scholar

Cite

Text

Tan et al. "Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1221

Markdown

[Tan et al. "Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/tan2025ijcai-interpreting/) doi:10.24963/IJCAI.2025/1221

BibTeX

@inproceedings{tan2025ijcai-interpreting,
  title     = {{Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)}},
  author    = {Tan, Zhen and Cheng, Lu and Wang, Song and Yuan, Bo and Li, Jundong and Liu, Huan},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10942-10946},
  doi       = {10.24963/IJCAI.2025/1221},
  url       = {https://mlanthology.org/ijcai/2025/tan2025ijcai-interpreting/}
}