The Fine-Grained Complexity of Gradient Computation for Training Large Language Models

Abstract

Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and backward computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis $\mathsf{SETH}$ is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.

Cite

Text

Alman and Song. "The Fine-Grained Complexity of Gradient Computation for Training Large Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-1963

Markdown

[Alman and Song. "The Fine-Grained Complexity of Gradient Computation for Training Large Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/alman2024neurips-finegrained/) doi:10.52202/079017-1963

BibTeX

@inproceedings{alman2024neurips-finegrained,
  title     = {{The Fine-Grained Complexity of Gradient Computation for Training Large Language Models}},
  author    = {Alman, Josh and Song, Zhao},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1963},
  url       = {https://mlanthology.org/neurips/2024/alman2024neurips-finegrained/}
}