STRATUS: A Multi-Agent System for Autonomous Reliability Engineering of Modern Clouds
Abstract
In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing human-in-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.
Cite
Text
Chen et al. "STRATUS: A Multi-Agent System for Autonomous Reliability Engineering of Modern Clouds." Advances in Neural Information Processing Systems, 2025.Markdown
[Chen et al. "STRATUS: A Multi-Agent System for Autonomous Reliability Engineering of Modern Clouds." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chen2025neurips-stratus/)BibTeX
@inproceedings{chen2025neurips-stratus,
title = {{STRATUS: A Multi-Agent System for Autonomous Reliability Engineering of Modern Clouds}},
author = {Chen, Yinfang and Pan, Jiaqi and Clark, Jackson and Su, Yiming and Zheutlin, Noah and Bhavya, Bhavya and Arora, Rohan R. and Deng, Yu and Jha, Saurabh and Xu, Tianyin},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/chen2025neurips-stratus/}
}