Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio

Abstract

Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.

Cite

Text

Li et al. "Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/318

Markdown

[Li et al. "Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/li2024ijcai-allocating/) doi:10.24963/ijcai.2024/318

BibTeX

@inproceedings{li2024ijcai-allocating,
  title     = {{Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio}},
  author    = {Li, Bo and Li, Zihao and Liu, Shengxin and Wu, Zekai},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {2868-2876},
  doi       = {10.24963/ijcai.2024/318},
  url       = {https://mlanthology.org/ijcai/2024/li2024ijcai-allocating/}
}