SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes

Abstract

Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task-agnostic approach wherein we pre-train a single model which subsumes a large number of Transformer models via linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simplicity, SuperShaper radically simplifies NAS for language models and discovers networks, via evolutionary algorithm, that effectively trade-off accuracy and model size. Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, a critical advantage of shape as a design variable for NAS is that the networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts.

Cite

Text

Ganesan et al. "SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes." ICML 2023 Workshops: ES-FoMO, 2023.

Markdown

[Ganesan et al. "SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/ganesan2023icmlw-supershaper/)

BibTeX

@inproceedings{ganesan2023icmlw-supershaper,
  title     = {{SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes}},
  author    = {Ganesan, Vinod and Ramesh, Gowtham and Kumar, Pratyush and Dabre, Raj},
  booktitle = {ICML 2023 Workshops: ES-FoMO},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/ganesan2023icmlw-supershaper/}
}