MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers

Abstract

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding---unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Cite

Text

Yu et al. "MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers." Neural Information Processing Systems, 2023.

Markdown

[Yu et al. "MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/yu2023neurips-megabyte/)

BibTeX

@inproceedings{yu2023neurips-megabyte,
  title     = {{MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers}},
  author    = {Yu, Lili and Simig, Daniel and Flaherty, Colin and Aghajanyan, Armen and Zettlemoyer, Luke and Lewis, Mike},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/yu2023neurips-megabyte/}
}