DP-LLM: Runtime Model Adaptation with Dynamic Layer-Wise Precision Assignment

Abstract

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

Cite

Text

Kwon et al. "DP-LLM: Runtime Model Adaptation with Dynamic Layer-Wise Precision Assignment." Advances in Neural Information Processing Systems, 2025.

Markdown

[Kwon et al. "DP-LLM: Runtime Model Adaptation with Dynamic Layer-Wise Precision Assignment." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/kwon2025neurips-dpllm/)

BibTeX

@inproceedings{kwon2025neurips-dpllm,
  title     = {{DP-LLM: Runtime Model Adaptation with Dynamic Layer-Wise Precision Assignment}},
  author    = {Kwon, Sangwoo and Seo, Seong Hoon and Lee, Jae W. and Park, Yeonhong},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/kwon2025neurips-dpllm/}
}