HyperGrid Transformers: Towards a Single Model for Multiple Tasks
Abstract
Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being $16$ times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
Cite
Text
Tay et al. "HyperGrid Transformers: Towards a Single Model for Multiple Tasks." International Conference on Learning Representations, 2021.Markdown
[Tay et al. "HyperGrid Transformers: Towards a Single Model for Multiple Tasks." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/tay2021iclr-hypergrid/)BibTeX
@inproceedings{tay2021iclr-hypergrid,
title = {{HyperGrid Transformers: Towards a Single Model for Multiple Tasks}},
author = {Tay, Yi and Zhao, Zhe and Bahri, Dara and Metzler, Donald and Juan, Da-Cheng},
booktitle = {International Conference on Learning Representations},
year = {2021},
url = {https://mlanthology.org/iclr/2021/tay2021iclr-hypergrid/}
}