EL-Attention: Memory Efficient Lossless Attention for Generation
Abstract
Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
Cite
Text
Yan et al. "EL-Attention: Memory Efficient Lossless Attention for Generation." International Conference on Machine Learning, 2021.Markdown
[Yan et al. "EL-Attention: Memory Efficient Lossless Attention for Generation." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/yan2021icml-elattention/)BibTeX
@inproceedings{yan2021icml-elattention,
title = {{EL-Attention: Memory Efficient Lossless Attention for Generation}},
author = {Yan, Yu and Chen, Jiusheng and Qi, Weizhen and Bhendawade, Nikhil and Gong, Yeyun and Duan, Nan and Zhang, Ruofei},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {11648-11658},
volume = {139},
url = {https://mlanthology.org/icml/2021/yan2021icml-elattention/}
}