MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
Abstract
We introduce MINI-SEQUENCE TRANSFORMER (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations due to our careful memory optimizations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks.
Cite
Text
Luo et al. "MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training." ICML 2024 Workshops: LCFM, 2024.Markdown
[Luo et al. "MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training." ICML 2024 Workshops: LCFM, 2024.](https://mlanthology.org/icmlw/2024/luo2024icmlw-minisequence/)BibTeX
@inproceedings{luo2024icmlw-minisequence,
title = {{MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training}},
author = {Luo, Cheng and Zhao, Jiawei and Chen, Zhuoming and Chen, Beidi and Anandkumar, Anima},
booktitle = {ICML 2024 Workshops: LCFM},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/luo2024icmlw-minisequence/}
}