ReFeR: Improving Evaluation and Reasoning Through Hierarchy of Models

Abstract

Assessing the quality of generative model outputs from large language models (LLMs) or vision-language models (VLMs), poses significant challenges. Traditional evaluation methods either rely on human assessment which is resource-intensive and not scalable or on automatic metrics that often correlate poorly with human preferences. Another approach is to train dedicated neural evaluators, but this typically requires substantial training data and compute. In this study, we thus introduce ReFeR, a tuning-free framework for evaluating generative outputs including both text and images, using a two-level hierarchy of pre-trained LLM and VLM evaluators. This multi-agent hierarchical strategy leverages additional compute at inference time by orchestrating multiple models and utilizing the increased test-time reasoning to boost performance. By having models themselves provide feedback and final judgments, ReFeR reduces the dependence on human evaluation. We rigorously evaluate ReFeR on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, ReFeR is also applicable for reasoning tasks - experiments on four reasoning benchmarks show ReFeR’s superior collective reasoning abilities. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more test-time compute efficient solution. ReFeR-Lite is $\sim12-14\times$ more compute efficient than previous works while being comparably accurate to ReFeR-Turbo.

Cite

Text

Narsupalli et al. "ReFeR: Improving Evaluation and Reasoning Through Hierarchy of Models." Transactions on Machine Learning Research, 2025.

Markdown

[Narsupalli et al. "ReFeR: Improving Evaluation and Reasoning Through Hierarchy of Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/narsupalli2025tmlr-refer/)

BibTeX

@article{narsupalli2025tmlr-refer,
  title     = {{ReFeR: Improving Evaluation and Reasoning Through Hierarchy of Models}},
  author    = {Narsupalli, Yaswanth and Chandra, Abhranil and Muppirala, Sreevatsa and Gupta, Manish and Goyal, Pawan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/narsupalli2025tmlr-refer/}
}