Jamba: Hybrid Transformer-Mamba Language Models

Abstract

We present Jamba, a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. We implement two configurations: Jamba-1.5-Large, with 94B active parameters, and Jamba-1.5-mini, with 12B active parameters. Built at large scale, Jamba models provide high throughput and small memory footprint compared to vanilla Transformers, especially at long-context tasks, with an effective context length of 256K tokens, the largest amongst open-weight models. At the same time, they are also competitive on standard language modeling and chatbot benchmarks. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. To support cost-effective inference, we introduce ExpertsInt8, a novel quantization technique that allows fitting Jamba-1.5-Large on a machine with 8 80GB GPUs when processing 256K-token contexts without loss of quality. We also describe several interesting properties of this architecture that the training and evaluation of Jamba have revealed. The model weights are publicly available.

Cite

Text

Lenz et al. "Jamba: Hybrid Transformer-Mamba Language Models." International Conference on Learning Representations, 2025.

Markdown

[Lenz et al. "Jamba: Hybrid Transformer-Mamba Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lenz2025iclr-jamba/)

BibTeX

@inproceedings{lenz2025iclr-jamba,
  title     = {{Jamba: Hybrid Transformer-Mamba Language Models}},
  author    = {Lenz, Barak and Lieber, Opher and Arazi, Alan and Bergman, Amir and Manevich, Avshalom and Peleg, Barak and Aviram, Ben and Almagor, Chen and Fridman, Clara and Padnos, Dan and Gissin, Daniel and Jannai, Daniel and Muhlgay, Dor and Zimberg, Dor and Gerber, Edden M. and Dolev, Elad and Krakovsky, Eran and Safahi, Erez and Schwartz, Erez and Cohen, Gal and Shachaf, Gal and Rozenblum, Haim and Bata, Hofit and Blass, Ido and Magar, Inbal and Dalmedigos, Itay and Osin, Jhonathan and Fadlon, Julie and Rozman, Maria and Danos, Matan and Gokhman, Michael and Zusman, Mor and Gidron, Naama and Ratner, Nir and Gat, Noam and Rozen, Noam and Fried, Oded and Leshno, Ohad and Antverg, Omer and Abend, Omri and Dagan, Or and Cohavi, Orit and Alon, Raz and Belson, Ro'i and Cohen, Roi and Gilad, Rom and Glozman, Roman and Lev, Shahar and Shalev-Shwartz, Shai and Meirom, Shaked Haim and Delbari, Tal and Ness, Tal and Asida, Tomer and Gal, Tom Ben and Braude, Tom and Pumerantz, Uriya and Cohen, Josh and Belinkov, Yonatan and Globerson, Yuval and Levy, Yuval Peleg and Shoham, Yoav},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/lenz2025iclr-jamba/}
}