Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination

Abstract

We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Mousavi et al. "Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination." NeurIPS 2023 Workshops: R0-FoMo, 2023.

Markdown

[Mousavi et al. "Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/mousavi2023neuripsw-enhancing/)

BibTeX

@inproceedings{mousavi2023neuripsw-enhancing,
  title     = {{Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination}},
  author    = {Mousavi, Sajad and Gutierrez, Ricardo Luna and Rengarajan, Desik and Gundecha, Vineet and Babu, Ashwin Ramesh and Naug, Avisek and Guillen, Antonio and Sarkar, Soumyendu},
  booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/mousavi2023neuripsw-enhancing/}
}