Structural Pruning of Pre-Trained Language Models via Neural Architecture Search

Abstract

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

Cite

Text

Klein et al. "Structural Pruning of Pre-Trained Language Models via Neural Architecture Search." Transactions on Machine Learning Research, 2024.

Markdown

[Klein et al. "Structural Pruning of Pre-Trained Language Models via Neural Architecture Search." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/klein2024tmlr-structural/)

BibTeX

@article{klein2024tmlr-structural,
  title     = {{Structural Pruning of Pre-Trained Language Models via Neural Architecture Search}},
  author    = {Klein, Aaron and Golebiowski, Jacek and Ma, Xingchen and Perrone, Valerio and Archambeau, Cedric},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/klein2024tmlr-structural/}
}