Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Abstract

Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.

Cite

Text

Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/dabas2025icml-just/)

BibTeX

@inproceedings{dabas2025icml-just,
  title     = {{Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning}},
  author    = {Dabas, Mahavir and Chen, Si and Fleming, Charles and Jin, Ming and Jia, Ruoxi},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {11846-11861},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/dabas2025icml-just/}
}