Expressing and Exploiting Parallelism in Language Model Decoding

Abstract

For autoregressive language models, decoding naturally occurs sequentially, generating tokens one after another. Recent attempts to introduce parallelism require a pre-determined structure to implement parallel generation, such as generating an outline and dividing the responses into parallel sub-tasks. In this work we explore a new technique to automate parallel generation by dynamically exploiting various parallel structure in the semantics of the language model response. Specifically, we introduce a simple annotation language MSG that allows language models to express parallelism in their outputs. We then develop an interpreter for MSG that performs on-the-fly parallel generation during decoding, exploiting the parallelism expressed in the MSG-annotated outputs. We demonstrate that our approach can improve tokens generated per second by 21\% while maintaining the same quality of output.

Cite

Text

Jin et al. "Expressing and Exploiting Parallelism in Language Model Decoding." ICLR 2024 Workshops: LLMAgents, 2024.

Markdown

[Jin et al. "Expressing and Exploiting Parallelism in Language Model Decoding." ICLR 2024 Workshops: LLMAgents, 2024.](https://mlanthology.org/iclrw/2024/jin2024iclrw-expressing/)

BibTeX

@inproceedings{jin2024iclrw-expressing,
  title     = {{Expressing and Exploiting Parallelism in Language Model Decoding}},
  author    = {Jin, Tian and Cheng, Ellie Y and Carbin, Michael},
  booktitle = {ICLR 2024 Workshops: LLMAgents},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/jin2024iclrw-expressing/}
}