ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Abstract

The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.

Cite

Text

Hu et al. "ShareBERT: Embeddings Are Capable of Learning Hidden Layers." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I16.29781

Markdown

[Hu et al. "ShareBERT: Embeddings Are Capable of Learning Hidden Layers." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/hu2024aaai-sharebert/) doi:10.1609/AAAI.V38I16.29781

BibTeX

@inproceedings{hu2024aaai-sharebert,
  title     = {{ShareBERT: Embeddings Are Capable of Learning Hidden Layers}},
  author    = {Hu, Jia-Cheng and Cavicchioli, Roberto and Berardinelli, Giulia and Capotondi, Alessandro},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {18225-18233},
  doi       = {10.1609/AAAI.V38I16.29781},
  url       = {https://mlanthology.org/aaai/2024/hu2024aaai-sharebert/}
}