SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

Park, Rihae; Kim, Yeonjae; Lee, Seung Yul; Park, Yeonhong; Lee, Jae W.

SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

Rihae Park, Yeonjae Kim, Seung Yul Lee, Yeonhong Park, Jae W. Lee

ICLR 2026

/iclr/2026/park2026iclr-sparetrain/

Abstract

Dual Modular Redundancy (DMR) is a highly effective mechanism for detecting silent data corruption (SDC)—a critical reliability concern in large language model (LLM) training—by executing each operation twice. However, its high computation overhead has prevented practical deployment at scale. In this paper, we present SpareTrain, an LLM training system that achieves complete DMR with minimal overhead by repurposing the activation checkpointing mechanism and exploiting idle GPU time. Evaluations on up to 32 H200 GPUs show that SpareTrain improves throughput by 12–35\% over naive DMR, corresponding to only 3–14\% overhead compared to unprotected training, while maintaining full DMR error detection capabilities.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Park et al. "SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy." International Conference on Learning Representations, 2026.

Markdown

[Park et al. "SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/park2026iclr-sparetrain/)

BibTeX

@inproceedings{park2026iclr-sparetrain,
  title     = {{SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy}},
  author    = {Park, Rihae and Kim, Yeonjae and Lee, Seung Yul and Park, Yeonhong and Lee, Jae W.},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/park2026iclr-sparetrain/}
}