Optimised Grouped-Query Attention Mechanism for Transformers

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5\% on MMLU compared to neighbour grouping. Our approach addresses the GQA’s trade-off problem between model performance and hardware efficiency.

Cite

Text

Chen et al. "Optimised Grouped-Query Attention Mechanism for Transformers." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Chen et al. "Optimised Grouped-Query Attention Mechanism for Transformers." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/chen2024icmlw-optimised/)

BibTeX

@inproceedings{chen2024icmlw-optimised,
  title     = {{Optimised Grouped-Query Attention Mechanism for Transformers}},
  author    = {Chen, Yuang and Zhang, Cheng and Gao, Xitong and Mullins, Robert D. and Constantinides, George Anthony and Zhao, Yiren},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/chen2024icmlw-optimised/}
}