TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment

Kong, Jason; Hu, Lanxiang; Ponzina, Flavio; Rosing, Tajana

TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment

Jason Kong, Lanxiang Hu, Flavio Ponzina, Tajana Rosing

ICMLW 2024

/icmlw/2024/kong2024icmlw-tinyagent/

Abstract

Deploying LLMs on edge devices is challenging due to stringent memory resources and compute constraints. In edge applications, existing deployment solutions for LLM agents disaggregate the fine-tuning process for domain-specific adaptation and the post-training model compression process. As a result, it requires extensive experimentation to find a readily available model compression technique that minimizes a fine-tuned model's performance loss while satisfying a target hardware's memory constraints. To address this problem, we propose TinyAgent, which optimizes the deployment workflow by using a quantization-aware model compression technique for specialized decision-making LLM agents under resource-constrained environments. Our approach takes into account both deployment-time hardware constraints and challenges in post-training quantization during fine-tuning. Experimental results demonstrate that our approach not only achieves 8$\times$ less memory usage to make LLM inference possible across a variety of edge devices, but also consistently speeds up LLM inference by up to $4.5\times$ without compromising accuracy.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Kong et al. "TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Kong et al. "TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/kong2024icmlw-tinyagent/)

BibTeX

@inproceedings{kong2024icmlw-tinyagent,
  title     = {{TinyAgent: Quantization-Aware Model Compression and Adaptation for On-Device LLM Agent Deployment}},
  author    = {Kong, Jason and Hu, Lanxiang and Ponzina, Flavio and Rosing, Tajana},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/kong2024icmlw-tinyagent/}
}