Towards Smaller Language Models via Layer Looping

Abstract

Language models store a huge amount of knowledge in their parameters. This dominant architecture bears little resemblance to the implementations of optimized data stores (e.g. a database management system like PostgreSQL), which begs the question: are there other architectures that can store and query the same information more efficiently? In this work, we explore two simple modifications to the standard architecture: looping --- sharing parameters across layers --- and mixture-of-experts (MoE). We compare the space complexity of standard and looped-moe models on a simple task where the model must memorize a knowledge graph (KG) and answer multi-hop queries over it. We prove that the looped-moe can store a KG of size $T$ and answer $q$-hop queries with $\mathcal{O}(T)$ parameters. In contrast, the best known upper bound for the standard model is $\mathcal{O}(qT)$ parameters. We confirm this scaling with experiments on synthetic KGs, finding that looped-conditional models can reliably answer four-hop queries over KGs that are $9\times$ larger than parameter-matched standard models can.

Cite

Text

Eyuboglu et al. "Towards Smaller Language Models via Layer Looping." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Eyuboglu et al. "Towards Smaller Language Models via Layer Looping." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/eyuboglu2024icmlw-smaller/)

BibTeX

@inproceedings{eyuboglu2024icmlw-smaller,
  title     = {{Towards Smaller Language Models via Layer Looping}},
  author    = {Eyuboglu, Sabri and Zinsley, Dylan and Saad-Falcon, Jon and Arora, Simran and Rudra, Atri and Zou, James and Re, Christopher},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/eyuboglu2024icmlw-smaller/}
}