Language Modelling via Learning to Rank
Abstract
We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-k ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using N-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses: that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally gives a modest improvement to perplexity (PPL) -- though often with statistical significance -- when compared to Kullback–Leibler-based KD. Surprisingly, given the naivety of the method, the N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers. Unsurprisingly, GPT-2 always acts as the best teacher. Using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.
Cite
Text
Frydenlund et al. "Language Modelling via Learning to Rank." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I10.21308Markdown
[Frydenlund et al. "Language Modelling via Learning to Rank." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/frydenlund2022aaai-language/) doi:10.1609/AAAI.V36I10.21308BibTeX
@inproceedings{frydenlund2022aaai-language,
title = {{Language Modelling via Learning to Rank}},
author = {Frydenlund, Arvid and Singh, Gagandeep and Rudzicz, Frank},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2022},
pages = {10636-10644},
doi = {10.1609/AAAI.V36I10.21308},
url = {https://mlanthology.org/aaai/2022/frydenlund2022aaai-language/}
}