D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

Abstract

We introduce D3 ("Diverse Data for Diff-by-Diff Coding"), a large dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D3 frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. The dataset contains 8 billion tokens of instruction + file-state + file-diff-sequence examples sampled from 850,000 human-written Python source files. To construct D3, we filter, augment, and annotate source code from The Stack by sampling synthetic file-diff sequences with a code analysis tool and labeling each sample with an LLM-generated rationale. In our experiments, we show that mid-training LMs like Llama 3.2 1b and 3b on D3 prior to supervised fine-tuning (SFT) on task-curated data improves performance on synthesis & editing tasks. On benchmarks like HumanEvalSynth and HumanEvalFix, we observe improvements in pass@1 of 3 to 6 points compared to direct SFT. D3-trained models are particularly strong at completing partial human-written solutions to programming problems.

Cite

Text

Piterbarg et al. "D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff." ICLR 2025 Workshops: Data_Problems, 2025.

Markdown

[Piterbarg et al. "D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff." ICLR 2025 Workshops: Data_Problems, 2025.](https://mlanthology.org/iclrw/2025/piterbarg2025iclrw-d3/)

BibTeX

@inproceedings{piterbarg2025iclrw-d3,
  title     = {{D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff}},
  author    = {Piterbarg, Ulyana and Gandhi, Kanishk and Pinto, Lerrel and Goodman, Noah and Fergus, Rob},
  booktitle = {ICLR 2025 Workshops: Data_Problems},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/piterbarg2025iclrw-d3/}
}