SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy
Abstract
Dual Modular Redundancy (DMR) is a highly effective mechanism for detecting silent data corruption (SDC)—a critical reliability concern in large language model (LLM) training—by executing each operation twice. However, its high computation overhead has prevented practical deployment at scale. In this paper, we present SpareTrain, an LLM training system that achieves complete DMR with minimal overhead by repurposing the activation checkpointing mechanism and exploiting idle GPU time. Evaluations on up to 32 H200 GPUs show that SpareTrain improves throughput by 12–35\% over naive DMR, corresponding to only 3–14\% overhead compared to unprotected training, while maintaining full DMR error detection capabilities.
Cite
Text
Park et al. "SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy." International Conference on Learning Representations, 2026.Markdown
[Park et al. "SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/park2026iclr-sparetrain/)BibTeX
@inproceedings{park2026iclr-sparetrain,
title = {{SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy}},
author = {Park, Rihae and Kim, Yeonjae and Lee, Seung Yul and Park, Yeonhong and Lee, Jae W.},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/park2026iclr-sparetrain/}
}