Improving Language Model Distillation Through Hidden State Matching
Abstract
Hidden State Matching is shown to improve knowledge distillation of language models by encouraging similarity between a student and its teacher's hidden states, as demonstrated by DistilBERT and its successors. This typically uses a cosine loss, which restricts the dimensionality of the student to the teacher's, severely limiting the compression ratio. We present an alternative technique using Centered Kernel Alignment (CKA) to match hidden states of different dimensionality, allowing for smaller students and higher compression ratios. We show the efficacy of our method using encoder--decoder (BART, mBART \& T5) and encoder-only (BERT) architectures across a range of tasks from classification to summarization and translation. Our technique is competitive with the current state-of-the-art distillation methods at comparable compression rates. It requires no pretrained student models, but rather can synthesize new student models from scratch through pretraining distillation. It can scale to students smaller than the current methods, is no slower in training and inference, and is considerably more flexible.
Cite
Text
Dasgupta and Cohn. "Improving Language Model Distillation Through Hidden State Matching." International Conference on Learning Representations, 2025.Markdown
[Dasgupta and Cohn. "Improving Language Model Distillation Through Hidden State Matching." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/dasgupta2025iclr-improving/)BibTeX
@inproceedings{dasgupta2025iclr-improving,
title = {{Improving Language Model Distillation Through Hidden State Matching}},
author = {Dasgupta, Sayantan and Cohn, Trevor},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/dasgupta2025iclr-improving/}
}