SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Abstract

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

Cite

Text

Chai et al. "SpeedLimit: Neural Architecture Search for Quantized Transformer Models." ICML 2023 Workshops: ES-FoMO, 2023.

Markdown

[Chai et al. "SpeedLimit: Neural Architecture Search for Quantized Transformer Models." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/chai2023icmlw-speedlimit/)

BibTeX

@inproceedings{chai2023icmlw-speedlimit,
  title     = {{SpeedLimit: Neural Architecture Search for Quantized Transformer Models}},
  author    = {Chai, Yuji and Bailey, Luke and Jin, Yunho and Ko, Glenn and Karle, Matthew and Brooks, David and Wei, Gu-Yeon and Kung, H.},
  booktitle = {ICML 2023 Workshops: ES-FoMO},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/chai2023icmlw-speedlimit/}
}