Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU

Abstract

Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.

Cite

Text

Guo et al. "Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/318

Markdown

[Guo et al. "Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/guo2025ijcai-efficient/) doi:10.24963/IJCAI.2025/318

BibTeX

@inproceedings{guo2025ijcai-efficient,
  title     = {{Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU}},
  author    = {Guo, Shuxi and Xu, Zikang and Liu, Jiahao and Zhang, Jinyi and Qi, Qi and Sun, Haifeng and Huang, Jun and Liao, Jianxin and Wang, Jingyu},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2856-2864},
  doi       = {10.24963/IJCAI.2025/318},
  url       = {https://mlanthology.org/ijcai/2025/guo2025ijcai-efficient/}
}