Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

Abstract

Deep learning has demonstrated transformative potential across domains, yet its inherent opacity has driven the development of Explainable Artificial Intelligence (XAI). Concept Bottleneck Models (CBMs), which enforce interpretability through human-understandable concepts, represent a prominent advancement in XAI. However, despite their semantic transparency, CBMs remain vulnerable to security threats such as backdoor attacks—malicious manipulations that induce controlled misbehaviors during inference. While CBMs leverage multimodal representations (visual inputs and textual concepts) to enhance interpretability, their dual-modality structure introduces unique, unexplored attack surfaces. To address this risk, we propose CAT (Concept-level Backdoor ATtacks), a methodology that injects stealthy triggers into conceptual representations during training. Unlike naive attacks that randomly corrupt concepts, CAT employs a sophisticated filtering mechanism to enable precise prediction manipulation without compromising clean-data performance. We further propose CAT+, an enhanced variant incorporating a concept correlation function to iteratively optimize trigger-concept associations, thereby maximizing attack effectiveness and stealthiness. Crucially, we validate our approach through a rigorous two-stage evaluation framework. First, we establish the fundamental vulnerability of the concept bottleneck layer in a controlled setting, showing that CAT+ achieves high attack success rates (ASR) while remaining statistically indistinguishable from natural data. Second, we demonstrate practical end-to-end feasibility via our proposed Image2Trigger_c method, which translates visual perturbations into concept-level triggers, achieving an end-to-end ASR of 53.29%. Extensive experiments show that CAT outperforms random-selection baselines significantly, and standard defenses like Neural Cleanse fail to detect these semantic attacks. This work highlights critical security risks in interpretable AI systems and provides a robust methodology for future security assessments of CBMs.

Cite

Text

Lai et al. "Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models." Transactions on Machine Learning Research, 2026.

Markdown

[Lai et al. "Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/lai2026tmlr-multimodal/)

BibTeX

@article{lai2026tmlr-multimodal,
  title     = {{Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models}},
  author    = {Lai, Songning and Yang, Jiayu and Huang, Yu and Hu, Lijie and TianlangXue,  and Hu, Zhangyi and Li, Jiaxu and Liao, Haicheng and Liu, Zongyang and Yue, Yutao},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/lai2026tmlr-multimodal/}
}