DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving

Abstract

Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. DéjàVu addresses all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

Cite

Text

Strati et al. "DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving." International Conference on Machine Learning, 2024.

Markdown

[Strati et al. "DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/strati2024icml-dejavu/)

BibTeX

@inproceedings{strati2024icml-dejavu,
  title     = {{DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving}},
  author    = {Strati, Foteini and Mcallister, Sara and Phanishayee, Amar and Tarnawski, Jakub and Klimovic, Ana},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {46745-46771},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/strati2024icml-dejavu/}
}