How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers

Menke, Antonio-Gabriel Chacón; Tan, Phan Xuan

How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers

Antonio-Gabriel Chacón Menke, Phan Xuan Tan

ICLRW 2025

/iclrw/2025/menke2025iclrw-effective/

Abstract

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures struggled with harm detection post-abliteration. These findings suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Menke and Tan. "How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers." ICLR 2025 Workshops: HAIC, 2025.

Markdown

[Menke and Tan. "How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers." ICLR 2025 Workshops: HAIC, 2025.](https://mlanthology.org/iclrw/2025/menke2025iclrw-effective/)

BibTeX

@inproceedings{menke2025iclrw-effective,
  title     = {{How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers}},
  author    = {Menke, Antonio-Gabriel Chacón and Tan, Phan Xuan},
  booktitle = {ICLR 2025 Workshops: HAIC},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/menke2025iclrw-effective/}
}