How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers

Abstract

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures struggled with harm detection post-abliteration. These findings suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

Cite

Text

Menke and Tan. "How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers." ICLR 2025 Workshops: HAIC, 2025.

Markdown

[Menke and Tan. "How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers." ICLR 2025 Workshops: HAIC, 2025.](https://mlanthology.org/iclrw/2025/menke2025iclrw-effective/)

BibTeX

@inproceedings{menke2025iclrw-effective,
  title     = {{How Effective Is Constitutional AI in Small LLMs? a Study on DeepSeek-R1 and Its Peers}},
  author    = {Menke, Antonio-Gabriel Chacón and Tan, Phan Xuan},
  booktitle = {ICLR 2025 Workshops: HAIC},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/menke2025iclrw-effective/}
}