Prima.cpp: Fast 30-70b LLM Inference on Heterogeneous and Low-Resource Home Clusters
Abstract
On-device inference offers privacy, offline use, and instant response, but consumer hardware restricts large language models (LLMs) to low throughput and capability. To overcome this challenge, we present prima.cpp, a distributed on-device inference system that runs 30-70B LLMs on consumer home clusters with mixed CPUs/GPUs, insufficient RAM/VRAM, slow disks, Wi-Fi links, and heterogeneous OSs. We introduce pipelined-ring parallelism (PRP) to overlap disk I/O with compute and communication, and address the prefetch-release conflict in mmap-based offloading. We further propose Halda, a heterogeneity-aware scheduler that co-optimizes per-device CPU/GPU workloads and device selection under RAM/VRAM constraints. On four consumer home devices, a 70B model reaches 674 ms/token TPOT with <6% memory pressure, and a 32B model with speculative decoding achieves 26 tokens/s. Compared with llama.cpp, exo, and dllama, our proposed prima.cpp achieves 5-17× lower TPOT, supports fine-grained model sizes from 8B to 70B, ensures broader cross-OS and quantization compatibility, and remains OOM-free, while also being Wi-Fi tolerant, privacy-preserving, and hardware-independent. The code is available at https://gitee.com/zonghang-li/prima.cpp.
Cite
Text
Li et al. "Prima.cpp: Fast 30-70b LLM Inference on Heterogeneous and Low-Resource Home Clusters." International Conference on Learning Representations, 2026.Markdown
[Li et al. "Prima.cpp: Fast 30-70b LLM Inference on Heterogeneous and Low-Resource Home Clusters." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-prima/)BibTeX
@inproceedings{li2026iclr-prima,
title = {{Prima.cpp: Fast 30-70b LLM Inference on Heterogeneous and Low-Resource Home Clusters}},
author = {Li, Zonghang and Li, Tao and Feng, Wenjiao and Xiao, Rongxing and She, Jianshu and Huang, Hong and Guizani, Mohsen and Yu, Hongfang and Ho, Qirong and Xiang, Wei and Liu, Xue},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/li2026iclr-prima/}
}