IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Abstract
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of $2.73\times$ - $7.63\times$ while retaining $98.6$% - $99.6$% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.
Cite
Text
Mao et al. "IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs." International Conference on Learning Representations, 2024.Markdown
[Mao et al. "IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/mao2024iclr-iceformer/)BibTeX
@inproceedings{mao2024iclr-iceformer,
title = {{IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs}},
author = {Mao, Yuzhen and Ester, Martin and Li, Ke},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/mao2024iclr-iceformer/}
}