ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents

Abstract

Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Human evaluations are not scalable, and traditional automatic metrics exhibit low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. The proposed framework enhances the accuracy of NLG evaluation, surpassing previous benchmarks by $\sim$20\%. Moreover, feedback collected from our framework is then leveraged to instruction fine-tune smaller models like Mistral-7B, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight another ancillary benefit of our methodology through its application on reasoning benchmarks, outperforming most of the state-of-the-art methods and also beating GPT-3.5 Turbo by $\sim$11.67\% and GPT-4 by $\sim$1\% on an average.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Narsupalli et al. "ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents." NeurIPS 2024 Workshops: SoLaR, 2024.

Markdown

[Narsupalli et al. "ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/narsupalli2024neuripsw-refer/)

BibTeX

@inproceedings{narsupalli2024neuripsw-refer,
  title     = {{ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents}},
  author    = {Narsupalli, Yaswanth and Chandra, Abhranil and Muppirala, Sreevatsa and Gupta, Manish and Goyal, Pawan},
  booktitle = {NeurIPS 2024 Workshops: SoLaR},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/narsupalli2024neuripsw-refer/}
}