TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment
Abstract
Deploying LLMs on edge devices is challenging due to stringent memory resources and compute constraints. In edge applications, existing deployment solutions for LLM agents disaggregate the fine-tuning process for domain-specific adaptation and the post-training model compression process. As a result, it requires extensive experimentation to find a readily available model compression technique that minimizes a fine-tuned model's performance loss while satisfying a target hardware's memory constraints. To address this problem, we propose TinyAgent, which optimizes the deployment workflow by using a quantization-aware model compression technique for specialized decision-making LLM agents under resource-constrained environments. Our approach takes into account both deployment-time hardware constraints and challenges in post-training quantization during fine-tuning. Experimental results demonstrate that our approach not only achieves 8$\times$ less memory usage to make LLM inference possible across a variety of edge devices, but also consistently speeds up LLM inference by up to $4.5\times$ without compromising accuracy.
Cite
Text
Kong et al. "TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment." ICML 2024 Workshops: ES-FoMo-II, 2024.Markdown
[Kong et al. "TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/kong2024icmlw-tinyagent/)BibTeX
@inproceedings{kong2024icmlw-tinyagent,
title = {{TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment}},
author = {Kong, Jason and Hu, Lanxiang and Ponzina, Flavio and Rosing, Tajana},
booktitle = {ICML 2024 Workshops: ES-FoMo-II},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/kong2024icmlw-tinyagent/}
}