Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Dabas, Mahavir; Chen, Si; Fleming, Charles; Jin, Ming; Jia, Ruoxi

Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, Ruoxi Jia

ICML 2025 pp. 11846-11861

/icml/2025/dabas2025icml-just/

Abstract

Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/dabas2025icml-just/)

BibTeX

@inproceedings{dabas2025icml-just,
  title     = {{Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning}},
  author    = {Dabas, Mahavir and Chen, Si and Fleming, Charles and Jin, Ming and Jia, Ruoxi},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {11846-11861},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/dabas2025icml-just/}
}