QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

ICML 2024 pp. 47901-47911

/icml/2024/tang2024icml-quest/

Abstract

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at https://github.com/mit-han-lab/quest.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Tang et al. "QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference." International Conference on Machine Learning, 2024.

Markdown

[Tang et al. "QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/tang2024icml-quest/)

BibTeX

@inproceedings{tang2024icml-quest,
  title     = {{QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference}},
  author    = {Tang, Jiaming and Zhao, Yilong and Zhu, Kan and Xiao, Guangxuan and Kasikci, Baris and Han, Song},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {47901-47911},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/tang2024icml-quest/}
}