DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Abstract

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

PDF NeurIPS OpenReview Code Semantic Scholar

Cite

Text

Lachaux et al. "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages." Neural Information Processing Systems, 2021.

Markdown

[Lachaux et al. "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/lachaux2021neurips-dobf/)

BibTeX

@inproceedings{lachaux2021neurips-dobf,
  title     = {{DOBF: A Deobfuscation Pre-Training Objective for Programming Languages}},
  author    = {Lachaux, Marie-Anne and Roziere, Baptiste and Szafraniec, Marc and Lample, Guillaume},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/lachaux2021neurips-dobf/}
}