MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference
Abstract
Large Language Models (LLMs) perform well across various natural language processing (NLP) tasks. However, the inference for extensive text generation faces challenges due to the significant memory demands of the key-value (KV) cache, which scales with sequence length. In this paper, we introduce a novel, bandwidth-efficient method for managing the KV cache. Utilizing learning-based techniques, our method predicts and retrieves only the essential KV entries, thereby eliminating the need for comprehensive KV pair transfers. Distinct from previous approaches, our method decouples the prediction phase from the computation phases by storing low-rank Keys in HBM, drastically reducing bandwidth consumption while minimally impacting memory usage and maintaining accuracy.
Cite
Text
Liu et al. "MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference." NeurIPS 2024 Workshops: Compression, 2024.Markdown
[Liu et al. "MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/liu2024neuripsw-maple/)BibTeX
@inproceedings{liu2024neuripsw-maple,
title = {{MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference}},
author = {Liu, Zhenyu and Zhang, Zhemin and Zhang, Zirui and Qin, Yanyuan and Luo, Jiayi and Gu, Zhenyu and Liu, Liu},
booktitle = {NeurIPS 2024 Workshops: Compression},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/liu2024neuripsw-maple/}
}