Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion

Abstract

Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks.

Cite

Text

Tan et al. "Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion." International Conference on Learning Representations, 2026.

Markdown

[Tan et al. "Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tan2026iclr-probing/)

BibTeX

@inproceedings{tan2026iclr-probing,
  title     = {{Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion}},
  author    = {Tan, Zhen and Zhao, Chengshuai and Wang, Song and Li, Jundong and Chen, Tianlong and Liu, Huan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tan2026iclr-probing/}
}