Leveraging Automated Unit Tests for Unsupervised Code Translation

Abstract

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java→Python and Python→C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

Cite

Text

Roziere et al. "Leveraging Automated Unit Tests for Unsupervised Code Translation." International Conference on Learning Representations, 2022.

Markdown

[Roziere et al. "Leveraging Automated Unit Tests for Unsupervised Code Translation." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/roziere2022iclr-leveraging/)

BibTeX

@inproceedings{roziere2022iclr-leveraging,
  title     = {{Leveraging Automated Unit Tests for Unsupervised Code Translation}},
  author    = {Roziere, Baptiste and Zhang, Jie and Charton, Francois and Harman, Mark and Synnaeve, Gabriel and Lample, Guillaume},
  booktitle = {International Conference on Learning Representations},
  year      = {2022},
  url       = {https://mlanthology.org/iclr/2022/roziere2022iclr-leveraging/}
}