SparQ Attention: Bandwidth-Efficient LLM Inference
Abstract
The computational difficulties of large language model (LLM) inference remains a significant obstacle to their widespread deployment, with long input sequences and large batches causing token-generation to be bottlenecked by data-transfer. For this reason, we introduce **SparQ Attention**, a technique for increasing LLM inference throughput by utilising memory bandwidth more efficiently within attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. By evaluating Llama $2$, Mistral and Pythia models on a wide range of downstream tasks, we show that SparQ Attention brings up to $8\times$ savings in attention data-transfer without substantial drops in accuracy.
Cite
Text
Ribar et al. "SparQ Attention: Bandwidth-Efficient LLM Inference." ICLR 2024 Workshops: PML4LRS, 2024.Markdown
[Ribar et al. "SparQ Attention: Bandwidth-Efficient LLM Inference." ICLR 2024 Workshops: PML4LRS, 2024.](https://mlanthology.org/iclrw/2024/ribar2024iclrw-sparq/)BibTeX
@inproceedings{ribar2024iclrw-sparq,
title = {{SparQ Attention: Bandwidth-Efficient LLM Inference}},
author = {Ribar, Luka and Chelombiev, Ivan and Hudlass-Galley, Luke and Blake, Charlie and Luschi, Carlo and Orr, Douglas},
booktitle = {ICLR 2024 Workshops: PML4LRS},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/ribar2024iclrw-sparq/}
}