Universal Model Routing for Efficient LLM Inference

Abstract

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as afeature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Cite

Text

Jitkrittum et al. "Universal Model Routing for Efficient LLM Inference." International Conference on Learning Representations, 2026.

Markdown

[Jitkrittum et al. "Universal Model Routing for Efficient LLM Inference." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/jitkrittum2026iclr-universal/)

BibTeX

@inproceedings{jitkrittum2026iclr-universal,
  title     = {{Universal Model Routing for Efficient LLM Inference}},
  author    = {Jitkrittum, Wittawat and Narasimhan, Harikrishna and Rawat, Ankit Singh and Juneja, Jeevesh and Wang, Congchao and Wang, Zifeng and Go, Alec and Lee, Chen-Yu and Shenoy, Pradeep and Panigrahy, Rina and Menon, Aditya Krishna and Kumar, Sanjiv},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/jitkrittum2026iclr-universal/}
}