Data Engineering for Scaling Language Models to 128k Context

ICML 2024 pp. 14125-14134

Abstract

We study continual pretraining recipe for scaling language models’ context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training (e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, naïvely upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance; a balanced domain mixture is equally important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Cite

Text

Fu et al. "Data Engineering for Scaling Language Models to 128k Context." International Conference on Machine Learning, 2024.

Markdown

[Fu et al. "Data Engineering for Scaling Language Models to 128k Context." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/fu2024icml-data/)

BibTeX

@inproceedings{fu2024icml-data,
  title     = {{Data Engineering for Scaling Language Models to 128k Context}},
  author    = {Fu, Yao and Panda, Rameswar and Niu, Xinyao and Yue, Xiang and Hajishirzi, Hannaneh and Kim, Yoon and Peng, Hao},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {14125-14134},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/fu2024icml-data/}
}