DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving
Abstract
Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. DéjàVu addresses all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
Cite
Text
Strati et al. "DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving." International Conference on Machine Learning, 2024.Markdown
[Strati et al. "DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/strati2024icml-dejavu/)BibTeX
@inproceedings{strati2024icml-dejavu,
title = {{DéjàVu: KV-Cache Streaming for Fast, Fault-Tolerant Generative LLM Serving}},
author = {Strati, Foteini and Mcallister, Sara and Phanishayee, Amar and Tarnawski, Jakub and Klimovic, Ana},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {46745-46771},
volume = {235},
url = {https://mlanthology.org/icml/2024/strati2024icml-dejavu/}
}