Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training

Abstract

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

Cite

Text

Luo et al. "Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training." Neural Information Processing Systems, 2024. doi:10.52202/079017-3086

Markdown

[Luo et al. "Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/luo2024neurips-minisequence/) doi:10.52202/079017-3086

BibTeX

@inproceedings{luo2024neurips-minisequence,
  title     = {{Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training}},
  author    = {Luo, Cheng and Zhao, Jiawei and Chen, Zhuoming and Chen, Beidi and Anandkumar, Anima},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3086},
  url       = {https://mlanthology.org/neurips/2024/luo2024neurips-minisequence/}
}