Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Abstract

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

Cite

Text

Cvitkovic et al. "Open Vocabulary Learning on Source Code with a Graph-Structured Cache." International Conference on Machine Learning, 2019.

Markdown

[Cvitkovic et al. "Open Vocabulary Learning on Source Code with a Graph-Structured Cache." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/cvitkovic2019icml-open/)

BibTeX

@inproceedings{cvitkovic2019icml-open,
  title     = {{Open Vocabulary Learning on Source Code with a Graph-Structured Cache}},
  author    = {Cvitkovic, Milan and Singh, Badal and Anandkumar, Animashree},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {1475-1485},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/cvitkovic2019icml-open/}
}