Multi-Turn Code Generation Through Single-Step Rewards

Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.

Cite

Text

Jain et al. "Multi-Turn Code Generation Through Single-Step Rewards." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Jain et al. "Multi-Turn Code Generation Through Single-Step Rewards." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jain2025icml-multiturn/)

BibTeX

@inproceedings{jain2025icml-multiturn,
  title     = {{Multi-Turn Code Generation Through Single-Step Rewards}},
  author    = {Jain, Arnav Kumar and Gonzalez-Pumariega, Gonzalo and Chen, Wayne and Rush, Alexander M and Zhao, Wenting and Choudhury, Sanjiban},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {26700-26716},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/jain2025icml-multiturn/}
}