Learned Best-Effort LLM Serving

Abstract

Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10× higher client request rates, serves above 96% of peak performance 4.1× more often, and serves above 98% of peak performance 2.3× more often than static serving on unpredictable workloads.

Cite

Text

Jha et al. "Learned Best-Effort LLM Serving." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Jha et al. "Learned Best-Effort LLM Serving." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/jha2024icmlw-learned/)

BibTeX

@inproceedings{jha2024icmlw-learned,
  title     = {{Learned Best-Effort LLM Serving}},
  author    = {Jha, Siddharth and Hooper, Coleman Richard Charles and Liu, Xiaoxuan and Kim, Sehoon and Keutzer, Kurt},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/jha2024icmlw-learned/}
}