SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes
Abstract
Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task-agnostic approach wherein we pre-train a single model which subsumes a large number of Transformer models via linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simplicity, SuperShaper radically simplifies NAS for language models and discovers networks, via evolutionary algorithm, that effectively trade-off accuracy and model size. Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, a critical advantage of shape as a design variable for NAS is that the networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts.
Cite
Text
Ganesan et al. "SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes." ICML 2023 Workshops: ES-FoMO, 2023.Markdown
[Ganesan et al. "SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/ganesan2023icmlw-supershaper/)BibTeX
@inproceedings{ganesan2023icmlw-supershaper,
title = {{SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes}},
author = {Ganesan, Vinod and Ramesh, Gowtham and Kumar, Pratyush and Dabre, Raj},
booktitle = {ICML 2023 Workshops: ES-FoMO},
year = {2023},
url = {https://mlanthology.org/icmlw/2023/ganesan2023icmlw-supershaper/}
}