Scaling On-Device GPU Inference for Large Generative Models

Tang, Jiuqiang; Sorokin, Raman; Ignasheva, Ekaterina; Jensen, Grant; Chen, Lin; Lee, Juhyun; Kulik, Andrei; Grundmann, Matthias

Scaling On-Device GPU Inference for Large Generative Models

Jiuqiang Tang, Raman Sorokin, Ekaterina Ignasheva, Grant Jensen, Lin Chen, Juhyun Lee, Andrei Kulik, Matthias Grundmann

CVPRW 2025 pp. 6355-6364

/cvprw/2025/tang2025cvprw-scaling/

Abstract

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift-an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.

PDF CVPRW Semantic Scholar

Cite

Text

Tang et al. "Scaling On-Device GPU Inference for Large Generative Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Tang et al. "Scaling On-Device GPU Inference for Large Generative Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/tang2025cvprw-scaling/)

BibTeX

@inproceedings{tang2025cvprw-scaling,
  title     = {{Scaling On-Device GPU Inference for Large Generative Models}},
  author    = {Tang, Jiuqiang and Sorokin, Raman and Ignasheva, Ekaterina and Jensen, Grant and Chen, Lin and Lee, Juhyun and Kulik, Andrei and Grundmann, Matthias},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {6355-6364},
  url       = {https://mlanthology.org/cvprw/2025/tang2025cvprw-scaling/}
}