Bifurcated Attention for Single-Context Large-Batch Sampling

Abstract

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.

Cite

Text

Athiwaratkun et al. "Bifurcated Attention for Single-Context Large-Batch Sampling." International Conference on Machine Learning, 2024.

Markdown

[Athiwaratkun et al. "Bifurcated Attention for Single-Context Large-Batch Sampling." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/athiwaratkun2024icml-bifurcated/)

BibTeX

@inproceedings{athiwaratkun2024icml-bifurcated,
  title     = {{Bifurcated Attention for Single-Context Large-Batch Sampling}},
  author    = {Athiwaratkun, Ben and Gonugondla, Sujan Kumar and Gouda, Sanjay Krishna and Qian, Haifeng and Ding, Hantian and Sun, Qing and Wang, Jun and Guo, Jiacheng and Chen, Liangfu and Bhatia, Parminder and Nallapati, Ramesh and Sengupta, Sudipta and Xiang, Bing},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {1971-1991},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/athiwaratkun2024icml-bifurcated/}
}