Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG

Abstract

Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (\textbf{TARG}), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft’s prefix logits, TARG computes lightweight uncertainty scores—mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-$N$ variance across a handful of stochastic prefixes—and triggers retrieval only when the score exceeds a threshold. The gate is model-agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently pushes the accuracy–efficiency frontier: compared with Always-RAG\footnote{\textsc{Always-RAG}: retrieve for every query; \textsc{Never-RAG}: never retrieve.}, TARG matches or improves EM/F1 while reducing retrieval by 70–90\% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-$N$ variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a $\Delta$-latency view to make budget trade-offs explicit.

Cite

Text

Wang et al. "Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG." Transactions on Machine Learning Research, 2026.

Markdown

[Wang et al. "Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/wang2026tmlr-retrieval/)

BibTeX

@article{wang2026tmlr-retrieval,
  title     = {{Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG}},
  author    = {Wang, Yufeng and Wei, Lu and Ling, Haibin},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/wang2026tmlr-retrieval/}
}