Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning

Abstract

Language models often misinterpret human intentions due to their handling of ambiguity, a limitation well-recognized in NLP research. While morally clear scenarios are more discernible to LLMs, greater difficulty is encountered in morally ambiguous contexts. In this investigation, we explored LLM calibration to show that human and LLM judgments are poorly aligned in such scenarios. We used two curated datasets from the Scruples project for evaluation: DILEMMAS, which involves pairs of distinct moral scenarios to assess the model’s ability to compare and contrast ethical situations, and ANECDOTES, which presents individual narratives to evaluate the model’s skill in drawing out details, interpreting, and analyzing distinct moral scenarios. Model answer probabilities were extracted for all possible choices and compared with human annotations to benchmark the alignment of three models— Llama-3.1-8b, Zephyr-7b-beta, and Mistral-7b. Significant improvements were observed after fine-tuning, with notable enhancements in both cross-entropy and Dirichlet scores, particularly in the latter. Notably, after fine-tuning, the performance of Mistral-7B-Instruct-v0.3 was on par with GPT-4o. However, the experimental models that were examined were all still outperformed by the BERT and RoBERTa models in terms of cross-entropy scores. Our fine-tuning approach demonstrated significant improvements in models' ability to navigate ethical dilemmas and open-ended narratives by aligning more closely with human moral reasoning. These findings establish a practical framework for refining training methods to address persistent calibration issues and improve ethical reasoning. By advancing AI's capability to tackle morally ambiguous decision-making, this work highlights the potential to create systems that are fairer, more reliable, and better equipped to support sensitive societal decision-making processes.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Senthilkumar et al. "Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning." NeurIPS 2024 Workshops: SoLaR, 2024.

Markdown

[Senthilkumar et al. "Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/senthilkumar2024neuripsw-enhancing/)

BibTeX

@inproceedings{senthilkumar2024neuripsw-enhancing,
  title     = {{Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning}},
  author    = {Senthilkumar, Pranav and Balasubramanian, Visshwa and Jain, Prisha and Maity, Aneesa and Lu, Jonathan and Zhu, Kevin},
  booktitle = {NeurIPS 2024 Workshops: SoLaR},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/senthilkumar2024neuripsw-enhancing/}
}