BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Abstract

Multi-task learning shares information between related tasks, sometimes reducing the number of parameters required. State-of-the-art results across multiple natural language understanding tasks in the GLUE benchmark have previously used transfer from a single large task: unsupervised pre-training with BERT, where a separate BERT model was fine-tuned for each task. We explore multi-task approaches that share a \hbox{single} BERT model with a small number of additional task-specific parameters. Using new adaptation modules, PALs or ‘projected attention layers’, we match the performance of separately fine-tuned models on the GLUE benchmark with $\approx$7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

Cite

Text

Stickland and Murray. "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning." International Conference on Machine Learning, 2019.

Markdown

[Stickland and Murray. "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/stickland2019icml-bert/)

BibTeX

@inproceedings{stickland2019icml-bert,
  title     = {{BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning}},
  author    = {Stickland, Asa Cooper and Murray, Iain},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {5986-5995},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/stickland2019icml-bert/}
}