Near-Optimal Online Deployment and Routing for Streaming LLMs

Abstract

The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples *stage-wise deployment* (at fixed maintenance windows) with *per-query routing* among live models. We introduce *StageRoute*, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: *StageRoute* tracks a strong oracle under tight budgets across diverse workloads.

Cite

Text

Li and Li. "Near-Optimal Online Deployment and Routing for Streaming LLMs." International Conference on Learning Representations, 2026.

Markdown

[Li and Li. "Near-Optimal Online Deployment and Routing for Streaming LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-nearoptimal-a/)

BibTeX

@inproceedings{li2026iclr-nearoptimal-a,
  title     = {{Near-Optimal Online Deployment and Routing for Streaming LLMs}},
  author    = {Li, Shaoang and Li, Jian},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-nearoptimal-a/}
}