Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
Abstract
Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.
Cite
Text
Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Dabas et al. "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/dabas2025icml-just/)BibTeX
@inproceedings{dabas2025icml-just,
title = {{Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning}},
author = {Dabas, Mahavir and Chen, Si and Fleming, Charles and Jin, Ming and Jia, Ruoxi},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {11846-11861},
volume = {267},
url = {https://mlanthology.org/icml/2025/dabas2025icml-just/}
}