Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Abstract

During inference for transformer-based LLMs, prefilling computes the key-value (KV) cache for prompt input tokens before autoregressive generation. This work highlights a pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs support longer context lengths, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. Prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm, then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches within a single sequence. On standard datasets with varying prompt lengths, our method significantly improves speed and memory efficiency compared to default padding-based prefilling in Huggingface across various model configurations and inference scenarios.

Cite

Text

Zhao et al. "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Zhao et al. "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/zhao2024icmlw-prepacking/)

BibTeX

@inproceedings{zhao2024icmlw-prepacking,
  title     = {{Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models}},
  author    = {Zhao, Siyan and Israel, Daniel Mingyi and Van den Broeck, Guy and Grover, Aditya},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/zhao2024icmlw-prepacking/}
}