Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning
Abstract
Instruction fine-tuning (IFT) has emerged as a ubiquitous strategy for specializing large language models (LLMs), yet it implicitly assumes a single, coherent "ground-truth" preference behind all human-written instructions. In practice, annotators differ in the styles, emphases, and granularities they prefer, introducing preference bias that can erode both robustness and generalization. We propose Dynamic Cross-Layer Preference Correction (\textsc{DCPC}), it couples (i) a preference-sensitive similarity estimator that detects mismatched instructional cues, (ii) cross-layer prefix alignment to reconcile semantic representations across transformer layers, and (iii) a lightweight Preference Correction Module (PCM) that dynamically adjusts hidden states to honor the inferred dominant preference. On five Super/GLUE tasks and the Alpaca set—plus six preference-shifted variants—DCPC boosts accuracy/F1-EM by 4.0–6.7 points and gpt-score by +0.7, while cutting inter-seed variance up to 35% on LlaMA-2 13B and Mistral-7B, setting a new state of the art for robust instruction tuning.
Cite
Text
Zhang et al. "Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhang et al. "Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-whose/)BibTeX
@inproceedings{zhang2025neurips-whose,
title = {{Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning}},
author = {Zhang, Jiayu and Li, Changbang and Peng, Yinan and Luo, Weihao and Yu, Peilai and Zhang, Xuan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhang2025neurips-whose/}
}