Cost-Efficient Collaboration Between On-Device and Cloud Language Models
Abstract
We investigate an emerging setup in which a small, on-device language model (LM) with access to local data collaborates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. *Can a local-remote collaboration reduce cloud inference costs while preserving quality?* First, we consider a naïve communication protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4× reduction in remote costs, but fails to recover the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined Minions, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed in-parallel locally. Minions reduces costs by 5.7× on average while recovering 97.9% of the performance of the remote model alone. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.
Cite
Text
Narayan et al. "Cost-Efficient Collaboration Between On-Device and Cloud Language Models." ICLR 2025 Workshops: FM-Wild, 2025.Markdown
[Narayan et al. "Cost-Efficient Collaboration Between On-Device and Cloud Language Models." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/narayan2025iclrw-costefficient/)BibTeX
@inproceedings{narayan2025iclrw-costefficient,
title = {{Cost-Efficient Collaboration Between On-Device and Cloud Language Models}},
author = {Narayan, Avanika and Eyuboglu, Sabri and Biderman, Dan and May, Avner and Linderman, Scott and Zou, James and Re, Christopher},
booktitle = {ICLR 2025 Workshops: FM-Wild},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/narayan2025iclrw-costefficient/}
}