Extracting Parallelism from Large Language Model Queries

Abstract

Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box. In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize these queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.

Cite

Text

Kolawole et al. "Extracting Parallelism from Large Language Model Queries." NeurIPS 2024 Workshops: AFM, 2024.

Markdown

[Kolawole et al. "Extracting Parallelism from Large Language Model Queries." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/kolawole2024neuripsw-extracting/)

BibTeX

@inproceedings{kolawole2024neuripsw-extracting,
  title     = {{Extracting Parallelism from Large Language Model Queries}},
  author    = {Kolawole, Steven and Santhanam, Keshav and Smith, Virginia and Thaker, Pratiksha},
  booktitle = {NeurIPS 2024 Workshops: AFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/kolawole2024neuripsw-extracting/}
}