Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
Abstract
We present the Llamba model series, a family of highly efficient recurrent language models distilled from the Llama-3.x family into the Mamba architecture. The series includes Llamba-1B, Llamba-4B, and Llamba-8B, delivering fast inference throughput while maintaining competitive benchmark performance. Beyond its computational advantages, Llamba showcases the effectiveness of the MOHAWK distillation framework, achieving high-quality performance while being distilled with less than 0.1\% of the data typically used for models of similar size. We also provide an optimized implementation of the Llamba models for deployment on resource-constrained devices, such as smartphones and edge platforms, providing a practical and memory-efficient alternative to traditional Transformer architectures. Overall, these models set new standards for speed, memory efficiency, and accessibility of language models.
Cite
Text
Bick et al. "Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing." ICLR 2025 Workshops: SCOPE, 2025.Markdown
[Bick et al. "Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing." ICLR 2025 Workshops: SCOPE, 2025.](https://mlanthology.org/iclrw/2025/bick2025iclrw-llamba/)BibTeX
@inproceedings{bick2025iclrw-llamba,
title = {{Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing}},
author = {Bick, Aviv and Katsch, Tobias and Sohoni, Nimit Sharad and Desai, Arjun D and Gu, Albert},
booktitle = {ICLR 2025 Workshops: SCOPE},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/bick2025iclrw-llamba/}
}