Mode-Conditioning Unlocks Superior Test-Time Compute Scaling
Abstract
Parallel sampling is essential to test-time scaling and reinforcement learning (RL), but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates sampling compute across reasoning modes using either specialist models or mode-specific prefixes. With predefined mode labels, ModC consistently improves test-time scaling (Pass@k) across controlled graph-search tasks and math reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves an 4× efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without predefined mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves Pass@k after RL training and can further boost the Pass@k gains of diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in parallel sampling.
Cite
Text
Wu et al. "Mode-Conditioning Unlocks Superior Test-Time Compute Scaling." International Conference on Learning Representations, 2026.Markdown
[Wu et al. "Mode-Conditioning Unlocks Superior Test-Time Compute Scaling." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-modeconditioning/)BibTeX
@inproceedings{wu2026iclr-modeconditioning,
title = {{Mode-Conditioning Unlocks Superior Test-Time Compute Scaling}},
author = {Wu, Chen Henry and Goyal, Sachin and Raghunathan, Aditi},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wu2026iclr-modeconditioning/}
}