Exploring and Improving Drafts in Blockwise Parallel Decoding

Abstract

Blockwise parallel decoding (BPD) was proposed in Stern et al. (2018) as a method to improve the inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified by the autoregressive model. Block drafts are generated by multiple independent prediction heads of blockwise parallel language models. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.

Cite

Text

Kim et al. "Exploring and Improving Drafts in Blockwise Parallel Decoding." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Kim et al. "Exploring and Improving Drafts in Blockwise Parallel Decoding." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/kim2024icmlw-exploring/)

BibTeX

@inproceedings{kim2024icmlw-exploring,
  title     = {{Exploring and Improving Drafts in Blockwise Parallel Decoding}},
  author    = {Kim, Taehyeon and Suresh, Ananda Theertha and Papineni, Kishore A and Riley, Michael and Kumar, Sanjiv and Benton, Adrian},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/kim2024icmlw-exploring/}
}