CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Zheng, Wenhao; Chen, Yixiao; Zhang, Weitong; Kundu, Souvik; Li, Yun; Liu, Zhengzhong; Xing, Eric P.; Wang, Hongyi; Yao, Huaxiu

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

NeurIPSW 2024

/neuripsw/2024/zheng2024neuripsw-citer/

Abstract

Large language models (LLMs) have achieved remarkable success in natural language processing tasks but suffer from high computational costs during inference, limiting their deployment in latency-constrained applications. To address this issue, we propose a novel \textbf{C}ollaborative \textbf{I}nference with \textbf{T}oken-l\textbf{E}vel \textbf{R}outing (CITER) framework that introduces a token-level routing mechanism, enabling efficient collaboration between small and large language models (SLMs \& LLMs). Specifically, CITER enables routing non-critical tokens to an SLM to reduce computational overhead, while critical tokens are processed by an LLM to maintain generation quality. We formulate the training of the router as a reinforcement learning task, where the router receives rewards based on both the quality of predictions and the inference cost of generation. To further accelerate the reward evaluation process, we introduce a shortcut for reward function estimation, significantly reducing the cost of the reward estimation. Extensive experiments demonstrate that CITER reduces inference cost while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Zheng et al. "CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing." NeurIPS 2024 Workshops: AFM, 2024.

Markdown

[Zheng et al. "CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/zheng2024neuripsw-citer/)

BibTeX

@inproceedings{zheng2024neuripsw-citer,
  title     = {{CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing}},
  author    = {Zheng, Wenhao and Chen, Yixiao and Zhang, Weitong and Kundu, Souvik and Li, Yun and Liu, Zhengzhong and Xing, Eric P. and Wang, Hongyi and Yao, Huaxiu},
  booktitle = {NeurIPS 2024 Workshops: AFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/zheng2024neuripsw-citer/}
}