Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks

Cunningham, Hoagy; Wei, Jerry; Wang, Zihan; Persic, Andrew; Peng, Alwin; Abderrachid, Jordan; Agarwal, Raj; Chen, Bobby; Dau, Andy; Dimitriev, Alek; Howard, Logan; Hua, Yijin; Gilson, Rob; Lin, Mu; Liu, Christopher; Mikulik, Vladimir; Mittapalli, Rohit; O'Hara, Clare; Pan, Jin; Saxena, Nikhil; Silverstein, Alex; Song, Yue; Zhou, Giulio; Leike, Jan; Kaplan, Jared; Perez, Ethan; Sharma, Mrinank

Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks

ICLR 2026

/iclr/2026/cunningham2026iclr-constitutional/

Abstract

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks---no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Cunningham et al. "Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks." International Conference on Learning Representations, 2026.

Markdown

[Cunningham et al. "Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/cunningham2026iclr-constitutional/)

BibTeX

@inproceedings{cunningham2026iclr-constitutional,
  title     = {{Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks}},
  author    = {Cunningham, Hoagy and Wei, Jerry and Wang, Zihan and Persic, Andrew and Peng, Alwin and Abderrachid, Jordan and Agarwal, Raj and Chen, Bobby and Dau, Andy and Dimitriev, Alek and Howard, Logan and Hua, Yijin and Gilson, Rob and Lin, Mu and Liu, Christopher and Mikulik, Vladimir and Mittapalli, Rohit and O'Hara, Clare and Pan, Jin and Saxena, Nikhil and Silverstein, Alex and Song, Yue and Zhou, Giulio and Leike, Jan and Kaplan, Jared and Perez, Ethan and Sharma, Mrinank},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/cunningham2026iclr-constitutional/}
}