Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Abstract
Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.
Cite
Text
Bergner et al. "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding." ICML 2024 Workshops: ES-FoMo-II, 2024.Markdown
[Bergner et al. "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/bergner2024icmlw-think/)BibTeX
@inproceedings{bergner2024icmlw-think,
title = {{Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding}},
author = {Bergner, Benjamin and Skliar, Andrii and Royer, Amelie and Blankevoort, Tijmen and Asano, Yuki M and Bejnordi, Babak Ehteshami},
booktitle = {ICML 2024 Workshops: ES-FoMo-II},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/bergner2024icmlw-think/}
}