Blessing of Class Diversity in Pre-Training

Abstract

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.

Cite

Text

Zhao et al. "Blessing of Class Diversity in Pre-Training." Artificial Intelligence and Statistics, 2023.

Markdown

[Zhao et al. "Blessing of Class Diversity in Pre-Training." Artificial Intelligence and Statistics, 2023.](https://mlanthology.org/aistats/2023/zhao2023aistats-blessing/)

BibTeX

@inproceedings{zhao2023aistats-blessing,
  title     = {{Blessing of Class Diversity in Pre-Training}},
  author    = {Zhao, Yulai and Chen, Jianshu and Du, Simon},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2023},
  pages     = {283-305},
  volume    = {206},
  url       = {https://mlanthology.org/aistats/2023/zhao2023aistats-blessing/}
}