UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Abstract

In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.

Cite

Text

Zhang et al. "UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I8.28783

Markdown

[Zhang et al. "UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhang2024aaai-unex/) doi:10.1609/AAAI.V38I8.28783

BibTeX

@inproceedings{zhang2024aaai-unex,
  title     = {{UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution}},
  author    = {Zhang, Gengrui and Wang, Yao and Chen, Xiaoshuang and Qian, Hongyi and Zhan, Kaiqiao and Wang, Ben},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {9305-9313},
  doi       = {10.1609/AAAI.V38I8.28783},
  url       = {https://mlanthology.org/aaai/2024/zhang2024aaai-unex/}
}