On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Liangfu Chen, Jiacheng Guo, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

ICMLW 2023

/icmlw/2023/athiwaratkun2023icmlw-ioefficient/

Abstract

Multi-query attention, a method that compresses all heads in the key and value tensors to a single head, has been known to improve inference efficiency by reducing the key and value tensor cache, allowing for incremental decoding with high batch sizes and context lengths. However, questions arise regarding how such compression affects performance compared to the traditional multi-head attention. In this paper, we investigate the scaling laws and performance of multi-query versus multi-head attention mechanisms, including a generalized multi-group attention which enables varying degrees of key-value compression. Our study reveals that each attention family demonstrates smooth and consistent performance scaling as the model size increases where the higher the compression corresponds to lower performance efficiency, leading to an upward shift in the validation loss versus size scaling curves. The finding implies that a multi-query model of comparable performance is slightly larger; hence, we present a comprehensive comparison of multi-head and multi-query models in terms of tradeoff with respect to latency, where we find that in high workload scenarios, the larger multi-query model can still be much more efficient. Additionally, we propose a novel context-aware bifurcated attention for the case of single-context batch sampling that substantially reduces memory IO, especially under high batch and context length. The bifurcated attention is an exact computation technique that divides any attention process into context and decoding components. Even though the bifurcated attention uses the same FLOPs as the original attention, it avoid the memory loading redundancy, resulting in much lower latency and makes multiple real-time recommendations available without much extra latency cost.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Athiwaratkun et al. "On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention." ICML 2023 Workshops: ES-FoMO, 2023.

Markdown

[Athiwaratkun et al. "On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/athiwaratkun2023icmlw-ioefficient/)

BibTeX

@inproceedings{athiwaratkun2023icmlw-ioefficient,
  title     = {{On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention}},
  author    = {Athiwaratkun, Ben and Gonugondla, Sujan Kumar and Gouda, Sanjay Krishna and Qian, Haifeng and Ding, Hantian and Sun, Qing and Wang, Jun and Chen, Liangfu and Guo, Jiacheng and Bhatia, Parminder and Nallapati, Ramesh and Sengupta, Sudipta and Xiang, Bing},
  booktitle = {ICML 2023 Workshops: ES-FoMO},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/athiwaratkun2023icmlw-ioefficient/}
}