Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction

Abstract

Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s on-policy chain-of-thought traces. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Anonymized code can be found at: https://github.com/RyanLucas3/Reasoning-Aware-Compression

Cite

Text

Lucas et al. "Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction." International Conference on Learning Representations, 2026.

Markdown

[Lucas et al. "Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lucas2026iclr-reasoning/)

BibTeX

@inproceedings{lucas2026iclr-reasoning,
  title     = {{Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction}},
  author    = {Lucas, Ryan and Behdin, Kayhan and Wang, Zhipeng and Song, Qingquan and Tang, Shao and Mazumder, Rahul},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lucas2026iclr-reasoning/}
}