Enabling On-Device Large Language Models with 3D-Stacked Memory
Abstract
In this paper, we address the growing need for new types of memories to enable deployment of on-device large language models (LLMs) to resource-constrained augmented reality (AR) edge devices. We evaluate the memory power and area savings using 3D-stacked memory (3D-DRAM, 3D-SRAM) versus conventional 2D memory (LPDDR-DRAM, SRAM). At target inference rate of 5-100 inferences per second, 3D-DRAM consumes the least memory power across all the memory options, achieving ∼7-15x improvement in memory power consumption compared with conventional 2D memory across our benchmark suite of on-device LLMs (Distilled GPT-2, GPT-2, BART Base, and BART Large). While 3D-SRAM can reduce memory dynamic power, the leakage power consumption for storing such a large model becomes prohibitively costly, hence why 3D-DRAM becomes a better option than 3D-SRAM for on-device LLMs. Additionally, since 3D-DRAM significantly reduces the memory power consumption for on-device LLMs to 10’s of mWs, 3D-DRAM enables the deployment of much larger LLMs that previously could not be deployed with conventional DRAM and 2D SRAM solutions.
Cite
Text
Yang et al. "Enabling On-Device Large Language Models with 3D-Stacked Memory." NeurIPS 2024 Workshops: MLNCP, 2024.Markdown
[Yang et al. "Enabling On-Device Large Language Models with 3D-Stacked Memory." NeurIPS 2024 Workshops: MLNCP, 2024.](https://mlanthology.org/neuripsw/2024/yang2024neuripsw-enabling/)BibTeX
@inproceedings{yang2024neuripsw-enabling,
title = {{Enabling On-Device Large Language Models with 3D-Stacked Memory}},
author = {Yang, Lita and Sreedhar, Kavya and Liu, Huichu and Beigne, Edith},
booktitle = {NeurIPS 2024 Workshops: MLNCP},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/yang2024neuripsw-enabling/}
}